Similarity Search: Part I, Chapter 1 ‹#› SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko book Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 2 Table of Content nPart I: Metric searching in a nutshell nFoundations of metric space searching nSurvey of existing approaches n nPart II: Metric searching in large collections nCentralized index structures nApproximate similarity search nParallel and distributed indexes Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 3 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 4 Distance searching problem nSearch problem: qData type qThe method of comparison qQuery formulation n nExtensibility: qA single indexing technique applied to many specific search problems quite different in nature q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 5 Distance searching problem nTraditional search: qExact (partial, range) retrieval qSortable domains of data (numbers, strings) nPerspective search: qProximity qSimilarity qDissimilarity qDistance qNot sortable domains (Hamming distance, color histograms) Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 6 Distance searching problem nDefinition (divide and conquer): nLet D be a domain, d a distance measure on objects from D n nGiven a set X Í D of n elements: n n preprocess or structure the data so that proximity queries are answered efficiently. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 7 Distance searching problem nMetric space as similarity search abstraction qDistances used for searching qNo coordinates – no data space partitioning qVector versus metric spaces n nThree reasons for metric indexes: qNo other possibility qComparable performance for special cases qHigh extensibility Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 8 Metric space nM = (D,d) qData domain D qTotal (distance) function d: D ´ D ® : (metric function or metric) nThe metric space postulates: qNon negativity qSymmetry qIdentity qTriangle inequality q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 9 Metric space nAnother specification: n q(p1) non negativity q(p2) symmetry q(p3) reflexivity q(p4) positiveness q(p5) triangle inequality q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 10 Pseudo metric nProperty (p4) does not hold nIf all objects at distance 0 are considered as single objects, we get the metric space: n qTo be proved qSince n qWe get Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 11 Quasi metric nProperty (p2 - symmetry) does not hold, e.g. qLocations in cities – one way streets n nTransformation to the metric space: n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 12 Super metric nAlso called the ultra metric nStronger constraint on (p5) n n n nAt least two sides of equal length - isosceles triangle nUsed in evolutionary biology (phylogenetic trees) n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 13 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 14 Distance measures nDiscrete qfunctions which return only a small (predefined) set of values q nContinuous qfunctions in which the cardinality of the set of values returned is very large or infinite. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 15 Minkowski distances nAlso called the Lp metrics nDefined on n dimensional vectors n n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 16 Special cases nL1 – Manhattan (City-Block) distance nL2 – Euclidean distance nL¥ – maximum (infinity) distance n L1 L2 L6 L¥ Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 17 Quadratic form distance nCorrelated dimensions – cross talk – e.g. color histograms n n nM – positive semidefinite matrix n ´ n qif M = diag(w1, … ,wn) ® weighted Euclidean distance Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 18 Example n3-dim vectors of blue, red, and orange colors: q qPure red: qPure orange: qPure blue: q q nBlue and orange images are equidistant from red one Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 19 Example (continue) nHuman color perception: qRed and orange are more alike than red and blue. nMatrix specification: n n n n nDistance of red and orange is nDistance of red and blue is blue red orange Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 20 Edit distance nAlso called the Levenstein distance: qminimum number of atomic operations to transform string x into string y q ninsert character c into string x at position i q ndelete character at position i in string x q nreplace character at position i in x with c n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 21 Edit distance - weights nIf the weights (costs) of insert and delete operations differ, the edit distance is not symmetric. n nExample: winsert = 2, wdelete = 1, wreplace = 1 n dedit(“combine”,”combination”) = 9 n replacement e ® a, insertion t,i,o,n n dedit(“combination”,” combine”) = 5 n replacement a ® e, deletion t,i,o,n n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 22 Edit distance - generalizations nReplacement of different characters can be different: a ® b different from a ® c n nIf it is symmetric, it is still the metric: a ® b must be the same as b ® a n nEdit distance can be generalized to tree structures Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 23 Jaccard’s coefficient nDistance measure for sets A and B n n nTanimoto similarity for vectors n n q n is the scalar product n is the Euclidean norm n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 24 Hausdorff distance nDistance measure for sets nCompares elements by a distance de n nMeasures the extent to which each point of the “model” set A lies near some point of the “image” set B and vice versa. n nTwo sets are within Hausdorff distance r from each other if and only if any point of one set is within the distance r from some point of the other set. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 25 Hausdorff distance (cont.) Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 26 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 27 Similarity Queries nRange query nNearest neighbor query nReverse nearest neighbor query nSimilarity join nCombined queries nComplex queries n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 28 Similarity Range Query n n n nrange query qR(q,r) = { x Î X | d(q,x) ≤ r } n n… all museums up to 2km from my hotel … r q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 29 Nearest Neighbor Query nthe nearest neighbor query qNN(q) = x qx Î X, "y Î X, d(q,x) ≤ d(q,y) n nk-nearest neighbor query qk-NN(q,k) = A qA Í X, |A| = k q"x Î A, y Î X – A, d(q,x) ≤ d(q,y) n n n n… five closest museums to my hotel … q k=5 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 30 Reverse Nearest Neighbor n n n… all hotels with a specific museum as a nearest cultural heritage cite … Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 31 Example of 2-RNN nObjects o4, o5, and o6 have q between their two nearest neighbor. o5 q o4 o6 o1 o2 o3 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 32 Similarity Queries nsimilarity join of two data sets q q q n nsimilarity self join ó X = Y n n…pairs of hotels and museums nwhich are five minutes walk apart … m Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 33 Combined Queries nRange + Nearest neighbors n n n n nNearest neighbor + similarity joins qby analogy Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 34 Complex Queries nFind the best matches of circular shape objects with red color n nThe best match for circular shape or red color needs not be the best match combined!!! Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 35 The A0 Algorithm nFor each predicate i qobjects delivered in decreasing similarity qincrementally build sets Xi with best matches till q q q nFor all qconsider all query predicates qestablish the final rank (fuzzy algebra, weighted sets, etc.) Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 36 Foundations of Metric Space Searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 37 Partitioning Principles nGiven a set X Í D in M=(D,d) three basic partitioning principles have been defined: q q qBall partitioning qGeneralized hyper-plane partitioning qExcluded middle partitioning Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 38 Ball partitioning nInner set: { x Î X | d(p,x) ≤ dm } nOuter set: { x Î X | d(p,x) > dm } n p dm Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 39 Multi-way ball partitioning nInner set: { x Î X | d(p,x) ≤ dm1 } nMiddle set: { x Î X | d(p,x) > dm1 Ù d(p,x) ≤ dm2} nOuter set: { x Î X | d(p,x) > dm2 } n p dm1 dm2 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 40 Generalized Hyper-plane Partitioning n{ x Î X | d(p1,x) ≤ d(p2,x) } n{ x Î X | d(p1,x) > d(p2,x) } n p2 p1 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 41 Excluded Middle Partitioning nInner set: { x Î X | d(p,x) ≤ dm - r } nOuter set: { x Î X | d(p,x) > dm + r } n n n n n n n nExcluded set: otherwise n p dm 2r p dm Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 42 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 43 Basic Strategies nCosts to answer a query are influenced by qPartitioning principle qQuery execution algorithm n nSequential organization & range query R(q,r) qAll database objects are consecutively scanned and d(q,o) are evaluated. qWhenever d(q,o) ≤ r, o is reported on result q 3 R(q,4): 10 8 1 …… Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 44 9 Basic Strategies (cont.) nSequential organization & k-NN query 3-NN(q) qInitially: take the first k objects and order them with respect to the distance from q. qAll other objects are consecutively scanned and d(q,o) are evaluated. qIf d(q,oi) ≤ d(q,ok), oi is inserted to a correct position in answer and the last neighbor ok is eliminated. q 3-NN(q): 1 …… 10 8 3 Answer: 8 3 1 3 1 1 1 3 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 45 Hypothetical Index Organization nA hierarchy of entries (nodes) N=(G,R(G)) qG = {e | e is object or e is another entry} qBounding region R(G) covers all elements of G. qE.g. ball region: "o, d(o,p) ≤ r n qEach element belongs exactly to one G. qThere is one root entry G. n nAny similarity query Q returns a set of objects qWe can define R(Q) which covers all objects in response. p r Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 46 Example of Index Organization nUsing ball regions qRoot node organizes four objects and two ball regions. qChild ball regions has two and three objects respectively. o1 o2 o3 o4 o5 o6 o7 o8 o9 o1 o5 o3 o2 o4 o6 o7 o8 o9 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 47 Range Search Algorithm nGiven Q=R(q,r): nStart at the root. nIn the current node N=(G,R(G)), process all elements: qobject element oj ÎG: nif d(q,oj) ≤ r, report oj on output. q qnon-object element N’=(G’,R(G’))ÎG nif R(G’) and R(Q) intersect, recursively search in N’. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 48 Range Search Algorithm (cont.) nR(q,r): nStart inspecting elements in B1. nB3 is not intersected. nInspect elements in B2. nSearch is complete. o5 o6 o7 o1 o3 o2 o4 Response = o8 , o8 o9 o9 B1 B2 B3 o1 o2 o3 o4 o5 o6 o7 o8 o9 q B3 B2 B1 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 49 Nearest Neighbor Search Algorithm nNo query radius is given. qWe do not know the distance to the k-th nearest neighbor. nTo allow filtering of unnecessary branches qThe query radius is defined as the distance to the current q k-th neighbor. nPriority queue PR is maintained. qIt contains regions that may include objects relevant to the query. qThe regions are sorted with decreasing relevance. q q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 50 NN Search Algorithm (cont.) nGiven Q=k-NN(q): nAssumptions: qThe query region R(Q) is limited by the distance (r) to the current k-th neighbor in the response. qWhenever PR is updated, its entries are sorted with decreasing proximity to q. qObjects in the response are sorted with increasing distance to q. The response can contain k objects at maximum. nInitialization: qPut the root node to PR. qPick k database objects at random and insert them into response. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 51 NN Search Algorithm (cont.) nWhile PR is not empty, repeat: qPick an entry N=(G,R(G)) from PR. qFor each object element oj ÎG: nif d(q,oj) ≤ r, add oj to the response. Update r and R(Q). qRemove entries from PR that cannot intersect the query. qFor each non-object element N’=(G’,R(G’))ÎG nif R(G’) and R(Q) intersect, insert N’ into PR. n nThe response contains k nearest neighbors to q. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 52 n3-NN(q): nPick three random objects. nProcess B1 nSkip B3 nProcess B2 nPR is empty, quit. B1 B2 B2 B1 NN Search Algorithm (cont.) o1 o5 o3 o2 o4 o6 o7 o8 o9 B1 B2 B3 PR= Response= o8, o1, o3 o8, o1, o4 Processing: o8, o1, o2 q o1 o4 o3 o2 o5 o6 o7 o8 o9 B3 B2 B1 o8, o9, o1 Final result Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 53 Incremental Similarity Search nHypothetical index structure is slightly modified: qElements of type 0 are objects e0. qElements e1 are ball regions (B2, B3) containing only objects, i.e. elements e0 . qElements e2 contain q elements e0 and e1 , e.g., B1. qElements have associated distance q functions from the query object q: nd0(q,e0 ) – for elements of type e0. ndt(q,et ) – for elements of type et. qE.g., dt(q,et)=d(q,p)-r (et is a ball with p and r). nFor correctness: dt(q,et) ≤ d0(q,e0) o1 o5 o3 o2 o4 o6 o7 o8 o9 B1 B2 B3 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 54 Incremental NN Search nBased on priority queue PR again qEach element et in PR knows also the distance dt(q,et). qEntries in the queue are sorted with respect to these distances. nInitialization: qInsert the root element with the distance 0 into PR. q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 55 Incremental NN Search (cont.) nWhile PR is not empty do qet ¬ the first element from PR qif t = 0 (et is an object) then report et as the next nearest neighbor. qelse insert each child element el of et with the distance dl(q,el ) into PR. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 56 (o4 ,6) (o2 ,4) (o1 ,3) (o6 ,7) (o3 ,7) (B2 ,0) (B3 ,5) (o4 ,6) (o2 ,4) (o1 ,3) (o9 ,2) (o1 ,3) (o6 ,7) (o3 ,7) (o6 ,7) (o2 ,4) (o4 ,6) (o5 ,5) (B3 ,5) (o3 ,7) (o3 ,7) (o4 ,6) (B3 ,5) (o4 ,6) (B3, 5) (o2 ,4) (B3 ,5) (o4 ,6) (o3 ,7) (o4 ,6) (B3 ,5) (o2 ,4) (o1 ,3) (o1 ,3) (o8 ,1) (o9 ,2) (o1 ,3) (o2 ,4) (B1 ,0) (o3 ,7) (o3 ,7) (o4 ,6) (o3 ,7) (o3 ,7) (o3 ,7) (o4 ,6) (o3 ,7) (o4 ,6) (B3 ,5) (o4 ,6) (o3 ,7) (o4 ,6) (B3 ,5) (o2 ,4) (B3 ,5) (o2 ,4) (o1 ,3) (o2 ,4) (B3 ,5) (o4 ,6) B1 B2 o8 o9 o1 NN(q): Incremental NN Search (cont.) o5 o6 o7 o1 o3 o2 o4 o8 o9 B1 B2 B3 Response = Queue = Processing: (o3 ,7) o8 , o9 , o1 q o4 o3 o2 o1 o5 o6 o7 o8 o9 B3 B2 o2 , o2 B3 (o7 ,8) o5 (o7 ,8) , o5 o4 (o3 ,7) (o7 ,8) , o4 (o3 ,7) (o7 ,8) o6 , o6 o3 (o7 ,8) , o3 o7 , o7 B1 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 57 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 58 Avoiding Distance Computations nIn metric spaces, the distance measure is expensive qE.g. edit distance, quadratic form distance, … nLimit the number of distance evaluations qIt speeds up processing of similarity queries nPruning strategies qobject-pivot qrange-pivot qpivot-pivot qdouble-pivot qpivot filtering Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 59 nAn index structure is built over 11 objects {o1,…,o11} qapplies ball-partitioning n n n n nRange query R(q,r) qSequential scan needs 11 distance computations. qReported objects: {o4 ,o6} Explanatory Example p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 q o9 p3 o11 o3 p1 p2 o10 o1 o5 o2 o4 o7 o8 o6 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 60 Object-Pivot Distance Constraint nUsually applied in leaf nodes nAssume the left-most leaf is visited qDistances from q to o4 ,o6 ,o10 must be computed n n n n nDuring insertion qDistances p2 to o4 ,o6 ,o10 were computed p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 61 o10 o4 o6 Object-Pivot Constraint (cont.) nHaving d(p2,o4), d(p2,o6), d(p2,o10) and d(p2,q) qsome distance calculations can be omitted nEstimation of d(q,o10) qusing only distances we cannot determine position of o10 qo10 can lie anywhere on the dotted circle q p2 r Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 62 nLower bound on d(q,o10) is |d(p2,o10) - d(q,p2) | qIf greater than the query radius, an object cannot qualify (o10) nUpper bound on d(q,o10) is d(q,p2) + d(p2,o10) qIf less than the query radius, an object directly qualifies! (o6) Object-Pivot Constraint (cont.) q p2 r o10 o4 o6 o6 o10 o4 q p2 r o10 o4 o6 o6 o10 o4 no10 has two extreme positions lower bound – o_10 & q lie in the same direction from p upper bound – o_10 & q lie in the opposite directions from p Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 63 Object-Pivot Constraint (summary) nGiven a metric space M=(D,d) and three objects q,p,o ÎD, the distance d(q,o) can be constrained: Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 64 nSome structures do not store all distances between database objects oi and a pivot p qa range [rl, rh] of distances between p and all oi is stored nAssume the left-most leaf is to be entered qUsing the range of distances to leaf objects, we can decide whether to enter or not n Range-Pivot Distance Constraint p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 ? Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 65 nKnowing interval [rl, rh] of distance in the leaf, we can optimize n n n n q n nLower bound is rl - d(q,p2) qIf greater than the query radius r, no object can qualify nUpper bound is rh + d(q,p2) qIf less than the query radius r, all objects qualify! Range-Pivot Constraint (cont.) o6 q p2 r o10 o4 rh rl o6 q p2 r o10 o4 rh rl o6 q p2 r o10 o4 rh rl Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 66 Range-Pivot Constraint (cont.) nWe have considered one position of q nThree are possible: q p o rh rl p o rh rl q p o rh rl q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 67 Range-Pivot Constraint (summary) nGiven a metric space M=(D,d) and objects p,oÎD such that rl ≤ d(o,p) ≤ rh. Given qÎD with known d(q,p). The distance d(q,o) is restricted by: n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 68 Pivot-Pivot Distance Constraint nIn internal nodes we can do more nAssume the root node is examined n n n n nWe can apply the range-pivot constraint to decide which sub-trees must be visited qThe ranges are known since during building phase all data objects were compared with p1. p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 ? ? Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 69 Pivot-Pivot Constraint (cont.) nSuppose we have followed the left branch (to p2) nKnowing the distance d(p1,p2) and using d(q,p1) qwe can apply the object-pivot constraint ® d(q,p2)Î[rl’,rh’] n q n n q nWe also know range of distances in p2’s sub-trees: d(o,p2)Î[rl,rh] p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 o11 p1 p2 q o10 o1 o5 o4 o6 r’h r’l Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 70 nHaving qd(q,p2)Î[rl’,rh’] qd(o,p2)Î[rl ,rh] n n n n nBoth ranges intersect ® lower bound on d(q,o) is 0! nUpper bound is rh+rh’ Pivot-Pivot Constraint (cont.) o11 p1 p2 q o10 o1 o5 o4 o6 r’h rl rh r’l Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 71 Pivot-Pivot Constraint (cont.) nIf ranges do not intersect, there are two possibilities. nThe first is: [rl,rh] is less than [rl’,rh’] qThe lower bound (left) is rl’ - rh qA view of the upper bound rh+rh’ (right) n n n n n nThe second is inverse - the lower limit is rl - rh’ p2 q o r’h r’l rl rh r’l - rh p2 q o r’h r’l rl rh rh + r’h Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 72 Pivot-Pivot Constraint (summary) nGiven a metric space M=(D,d) and objects q,p,oÎD such that rl ≤ d(o,p) ≤ rh and rl’ ≤ d(q,p) ≤ rh’. The distance d(q,o) can be restricted by: n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 73 Double-Pivot Distance Constraint nPrevious constraints use just one pivot along with ball partitioning. nApplying generalized hyper-plane, we have two pivots. qNo upper bound on d(q,o) can be defined! p2 q o p1 Equidistant line Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 74 Double-Pivot Constraint (cont.) nIf q and o are in different subspaces qLower bound is (d(q,p1) – d(q,p2))/2 qHyperbola shows the positions with constant lower bound. n nMoving q up (so “visual” distance from equidistant line is preserved), decreases the lower bound. n nIf q and o are in the same subspace qthe lower bound is zero p2 q o p1 p2 q o p1 q’ Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 75 Double-Pivot Constraint (summary) nGiven a metric space M=(D,d) and objects o,p1,p2ÎD such that d(o,p1) ≤ d(o,p2). Given a query object qÎD with d(q,p1) and d(q,p2). The distance d(q,o) can be lower-bounded by: n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 76 Pivot Filtering nExtended object-pivot constraint qUses more pivots nUses triangle inequality for pruning nAll distances between objects and a pivot p are known nPrune object o Î X if any holds qd(p,o) < d(p,q) – r qd(p,o) > d(p,q) + r q r p Generalized concept of the object-pivot constraint. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 77 Pivot Filtering (cont.) nFiltering with two pivots qOnly Objects in the dark blue q region have to be checked. q qEffectiveness is improved q using more pivots. n p1 p2 q r o1 o2 o3 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 78 Pivot Filtering (summary) nGiven a metric space M=(D,d) and a set of pivots n P = { p1, p2, p3,…, pn }. We define a mapping function Y: (D,d) ® (:n,L∞) as follows: n n Y(o) = (d(o,p1), …, d(o,pn)) n n Then, we can bound the distance d(q,o) from below: n n L∞(Y(o), Y(q)) ≤ d(q,o) Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 79 Pivot Filtering (consideration) nGiven a range query R(q,r) qWe want to report all objects o such that d(q,o) ≤ r nApply the pivot filtering nWe can discard objects for which qL∞(Y(o), Y(q)) > r holds, i.e. the lower bound on d(q,o) is greater than r. nThe mapping Y is contractive: qNo eliminated object can qualify. qSome qualifying objects need not be relevant. nThese objects have to be checked against the original n function d(). Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 80 Constraints & Explanatory Example 1 nRange query R(q,r) = {o4,o6,o8} qSequential scan: 11 distance computations qNo constraint: 3+8 distance computations p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 o9 p3 o11 o3 p1 p2 q o10 o1 o5 o2 o4 o7 o8 o6 First number of distance computations is needed for navigation (comparing q with pivots)!!! Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 81 Constraints & Explanatory Example 2 nRange query R(q,r) = {o4,o6,o8} qOnly object-pivot in leaves: 3+2 distance computations no6 is included without computing d(q,o6) no10 ,o2 ,o9 ,o3 ,o7 are eliminated without computing. p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 o9 p3 o11 o3 p1 p2 q o10 o1 o5 o2 o4 o7 o8 o6 First number of distance computations is needed for navigation (comparing q with pivots)!!! Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 82 Constraints & Explanatory Example 3 nRange query R(q,r) = {o4,o6,o8} qOnly range-pivot: 3+6 distance computations no2 ,o9 are pruned. qOnly range-pivot +pivot-pivot: 3+6 distance computations p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 o9 p3 o11 o3 p1 p2 q o10 o1 o5 o2 o4 o7 o8 o6 First number of distance computations is needed for navigation (comparing q with pivots)!!! Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 83 Constraints & Explanatory Example 4 nRange query R(q,r) = {o4,o6,o8} qAssume: objects know distances to pivots along paths to the root. qOnly pivot filtering: 3+3 distance computations (to o4 , o6 , o8) qAll constraints together: 3+2 distance computations (to o4 ,o8) p1 p2 o4 o6 o10 o1 o5 o11 p3 o2 o9 o3 o7 o8 o9 p3 o11 o3 p1 p2 q o10 o1 o5 o2 o4 o7 o8 o6 First number of distance computations is needed for navigation (comparing q with pivots)!!! Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 84 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 85 Metric Space Transformation nChange one metric space into another qTransformation of the original objects qChanging the metric function qTransforming both the function and the objects n nMetric space embedding qCheaper distance function nUser-defined search functions Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 86 Metric Space Transformation n M1 = (D1, d1) M2 = (D2, d2) n nFunction n n n nTransformed distances need not be equal Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 87 Lower Bounding Metric Functions nBounds on transformations nExploitable by index structures n nHaving functions d1,d2: D ®D nd1 is a lower-bounding distance function of d2 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 88 Lower Bounding Functions (cont.) nScaling factor qSome metric functions cannot be bounded. qWe can bound them if they are reduced by a factor s q q q q qs·d1 is a lower-bounding function of d2 qMaximum of all possible values of s is called the optimal scaling factor. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 89 Example of Lower Bounding Functions nLp metrics qAny Lp’ metric is lower-bounding an Lp metric if p ≤ p’ qLet are two vectors in a 2-D space q q q q q qL1 is always bigger than L2 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 90 Example of Lower Bounding Functions nQuadratic Form Distance Function qBounded by a scaled L2 norm qOptimal scaling factor is where li denote the eigenvalues of the quadratic form function matrix. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 91 User-defined Metric Functions nDifferent users have different preferences qSome people prefer car’s speed qOthers prefer lower prices qetc… q nPreferences might be complex qColor histograms, data-mining systems qCan be learnt automatically nfrom the previous behavior of a user Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 92 User-defined Metric Functions nPreferences expressed as another distance function du qcan be different for different users qExample: matrices for quadratic form distance functions n nDatabase indexed with a fixed metric db n nLower-bounding metric function dp qlower-bounds db and du qit is applied during the search qcan exploit properties the index structure Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 93 User-defined Metric Functions nSearching using dp qsearch the index, but use dp instead of db nPossible, because § § qevery object that would match similarity query using db will certainly match with dp nFalse-positives in the result qfiltered afterwards - using du qpossible, because Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 94 Embedding the Metric Space nTransform the metric space qCheaper metric function d2 qApproximate the original d1 distances n n nDrawbacks qMust transform objects using the function f qFalse-positives npruned using the original metric function Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 95 Embedding Examples nLipschitz Embedding qMapping to an n-dimensional vector space nCoordinates correspond to chosen subsets Si of objects nObject is then a vector of distances to the closest object from a particular coordinate set Si n n qTransformation is very expensive nSparseMap extension reduces this cost Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 96 Embedding Examples nKarhunen-Loeve tranformation qLinear transformation of vector spaces qDimensionality reduction technique nSimilar to Principal Component Analysis qProjects object o onto the first k < n basis vectors q q qTransformation is contractive qUsed in the FastMap technique Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 97 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 98 Principles of Approx. Similarity Search nApproximate similarity search over-comes problems of exact similarity search when using traditional access methods. qModerate improvement of performance with respect to the sequential scan. qDimensionality curse nSimilarity search returns mathematically precise result sets. qSimilarity is often subjective, so in some cases also approximate result sets satisfy the user’s needs. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 99 Principles of Approx. Similarity Search (cont.) nApproximate similarity search processes a query faster at the price of imprecision in the returned result sets. qUseful, for instance, in interactive systems: nSimilarity search is typically an iterative process nUsers submit several search queries before being satisfied qFast approximate similarity search in intermediate queries can be useful. qImprovements up to two orders of magnitude Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 100 Approx. Similarity Search: Basic Strategies nSpace transformation qDistance preserving transformations nDistances in the transformed space are smaller than in the original space. nPossible false hits qExample: nDimensionality reduction techniques such as nKLT, DFT, DCT, DWT nVA-files qWe will not discuss this approximation strategy in details. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 101 Basic Strategies (cont.) nReducing subsets of data to be examined qNot promising data is not accessed. qFalse dismissals can occur. qThis strategy will be discussed more deeply in the following slides. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 102 Reducing Volume of Examined Data nPossible strategies: n qEarly termination strategies nA search algorithm might stop before all the needed data has been accessed. q qRelaxed branching strategies nData regions overlapping the query region can be discarded depending on a specific relaxed pruning strategy. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 103 Early Termination Strategies nExact similarity search algorithms are qIterative processes, where qCurrent result set is improved at each step. n nExact similarity search algorithms stop qWhen no further improvement is possible. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 104 Early Termination Strategies (cont.) nApproximate similarity search algorithms qUse a “relaxed” stop condition that qstops the algorithm when little chances of improving the current results are detected. n nThe hypothesis is that qA good approximation is obtained after a few iterations. qFurther steps would consume most of the total search costs and would only marginally improve the result-set. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 105 Early Termination Strategies (cont.) Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 106 Relaxed Branching Strategies nExact similarity search algorithms qAccess all data regions overlapping the query region and discard all the others. nApproximate similarity search algorithms qUse a “relaxed” pruning condition that qRejects regions overlapping the query region when it detects a low likelihood that data objects are contained in the intersection. nIn particular, useful and effective with access methods based on hierarchical decomposition of the space. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 107 Approximate Search: Example nA hypothetical index structure qThree ball regions o9 o11 o10 o8 o4 o5 o6 o7 o2 o1 o3 B1 B2 B3 B3 B2 B1 o1 o2 o3 o11 o8 o9 o10 o7 o4 o5 o6 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 108 Approximate Search: Range Query nGiven a range query: nAccess B1 qReport o1 qIf early termination stopped now, we would loose objects. nAccess B2 qReport o4 ,o5 qIf early termination stopped now, we would not loose anything. nAccess B3 qNothing to report qA relaxed branching strategy may discard this region – we don’t loose anything. o9 o11 o10 o8 o4 o5 o6 o7 o2 o1 o3 q B1 B2 B3 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 109 Approximate Search: 2-NN Query o11 o10 o8 o4 o5 o7 o2 o1 o3 q B1 B2 B3 nGiven a 2-NN query: nAccess B1 qNeighbors: o1 ,o3 qIf early termination stopped now, we would loose objects. nAccess B2 qNeighbors: o4 ,o5 qIf early termination stopped now, we would not loose anything. nAccess B3 qNeighbors: o4 ,o5 – no change qA relaxed branching strategy may discard this region – we don’t loose anything. r=¥ o6 o9 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 110 Measures of Performance nPerformance assessments of approximate similarity search should consider qImprovement in efficiency qAccuracy of approximate results nTypically there is a trade-off between the two qHigh improvement in efficiency is obtained at the cost of accuracy in the results. nGood approximate search algorithms should qoffer high improvement in efficiency with high accuracy in the results. q q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 111 Measures of Performance: Improvement in Efficiency nImprovement in Efficiency (IE) is expressed as qthe ratio between the cost of the exact and approximate execution of a query Q: q q q qCost and CostA denote the number of disk accesses or alternatively the number of distance computations for the precise and approximate execution of Q, respectively. nQ is a range or k-nearest neighbors query. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 112 Improvement in Efficiency (cont.) nIE=10 means that approximate execution is 10 times faster nExample: qexact execution 6 minutes qapproximate execution 36 seconds q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 113 Measures of Performance: Precision and Recall nWidely used in Information Retrieval as a performance assessment. n nPrecision: ratio between the retrieved qualifying objects and the total objects retrieved. n nRecall: ratio between the retrieved qualifying objects and the total qualifying objects. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 114 Precision and Recall (cont.) nAccuracy can be quantified with Precision (P) and Recall (R): n n n qS – qualifying objects, i.e., objects retrieved by the precise algorithm qSA – actually retrieved objects, i.e., objects retrieved by the approximate algorithm Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 115 Precision and Recall (cont.) nThey are very intuitive but in our context qTheir interpretation is not obvious & misleading!!! nFor approximate range search we typically n have SA Í S qTherefore, precision is always 1 in this case nResults of k-NN(q) have always size k qTherefore, precision is always equal to recall in this case. nEvery element has the same importance qLoosing the first object rather than the 1000th one is the same. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 116 Precision and Recall (cont.) nSuppose a 10-NN(q): qS={1,2,3,4,5,6,7,8,9,10} qSA1={2,3,4,5,6,7,8,9,10,11} the object 1 is missing qSA2={1,2,3,4,5,6,7,8,9,11} the object 10 is missing n nIn both cases: P = R = 0.9 qHowever SA2 can be considered better than SA1. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 117 Precision and Recall (cont.) nSuppose 1-NN(q): qS={1} qSA1={2} just one object was skipped qSA2={10000} the first 9,999 objects were skipped n nIn both cases: P = R = 0 qHowever SA1 can be considered much better than SA2. q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 118 Measures of Performance: Relative Error on Distances nAnother possibility to assess the accuracy is the use of the relative error on distances (ED) qIt compares the distances from a query object to objects in the approximate and exact results n n q where oA and oN are the approximate and the actual nearest neighbors, respectively. nGeneralisation to the case of the j-th NN: n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 119 Relative Error on Distances (cont.) nIt has a drawback: qIt does not take the distribution of distances into account. n nExample1: The difference in distance from the query object to oN and oA is large (compared to the range of distances) qIf the algorithm misses oN and takes oA, ED is large even if just one object has been missed. q q oN oA Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 120 Relative Error on Distances (cont.) nExample 2: Almost all objects have the same (large) distance from the query object. qChoosing the farthest rather than the nearest neighbor would produce a small ED, even if almost all objects have been missed. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 121 Measures of Performance: Error on Position nAccuracy can also be measured as the Error on Position (EP) qi.e., the discrepancy between the ranks in approximate and exact results. nObtained using the Sperman Footrule Distance (SFD): q q q|X| – the dataset’s cardinality qSi(o) – the position of object o in the ordered list Si Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 122 Error on Position (cont.) nSFD computes correlation of two ordered lists. qRequires both the lists to have identical elements. nFor partial lists: Induced Footrule Distance (IFD): n n n n qOX – the list containing the entire dataset ordered with respect to q. qSA – the approximate result ordered with respect to q. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 123 Error on Position (cont.) nPosition in the approximate result is always smaller than or equal to the one in the exact result. qSA is a sub-lists of OX qSA(o) £ OX(o) nA normalisation factor |SA|×|X| can also be used nThe error on position (EP) is defined as Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 124 Error on Position (cont.) nSuppose |X|=10,000 n nLet us consider a 10-NN(q): qS={1,2,3,4,5,6,7,8,9,10} qSA1={2,3,4,5,6,7,8,9,10,11} the object 1 is missing qSA2={1,2,3,4,5,6,7,8,9,11} the object 10 is missing nAs also intuition suggests: qIn case of SA1, EP = 10 / (10 × 10,000) = 0.0001 qIn case of SA2, EP = 1 / (10 × 10,000) = 0.00001 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 125 Error on Position (cont.) nSuppose |X|=10,000 n nLet us consider a 1-NN(q): qS={1} qSA1={2} just one object was skipped qSA2={10,000} the first 9,999 objects were skipped nAs also intuition suggests : qIn case of SA1, EP = (2-1)/(1×10,000) = 1/(10,000) = 0.0001 qIn case of SA2, EP = (10,000-1)/(1×10,000) = 0.9999 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 126 Foundations of metric space searching 1.distance searching problem in metric spaces 2.metric distance measures 3.similarity queries 4.basic partitioning principles 5.principles of similarity query execution 6.policies to avoid distance computations 7.metric space transformations 8.principles of approximate similarity search 9.advanced issues Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 127 Statistics on Metric Datasets nStatistical characteristics of datasets form the basis of performance optimisation in databases. nStatistical information is used for qCost models qAccess structure tuning nTypical statistical information qHistograms of frequency values for records in databases qDistribution of data, in case of data represented in a vector space Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 128 Statistics on Metric Datasets (cont.) nHistograms and data distribution cannot be used in generic metric spaces qWe can only rely on distances qNo coordinate system can be used nStatistics useful for techniques for similarity searching in metric spaces are qDistance density and distance distribution qHomogeneity of viewpoints qProximity of ball regions q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 129 Data Density vs. Distance Density nData density (applicable just in vector spaces) qcharacterizes how data are placed in the space qcoordinates of objects are needed to get their position n nDistance density (applicable in generic metric spaces) qcharacterizes distances among objects qno need of coordinates qjust a distance functions is required Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 130 Data Density vs. Distance Density (cont.) nData density nDistance density from the object p Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 131 Distance Distribution and Distance Density nThe distance distribution with respect to the object p (viewpoint) is q q nwhere Dp is a random variable corresponding to the ndistance d(p,o) and o is a random object of the metric nspace. n nThe distance density from the object p can be obtained as the derivative of the distribution. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 132 Distance Distribution and Distance Density (cont.) nThe overall distance distribution (informally) is the probability of distances among objects n q q where o1 and o2 are random objects of the metric space. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 133 Homogeneity of Viewpoints nA viewpoint (distance distribution from p) is different from another viewpoint. qDistances from different objects are distributed differently. nA viewpoint is different from the overall distance distribution. qThe overall distance distribution characterize the entire set of possible distances. nHowever, the overall distance distribution can be used in place of any viewpoint if the dataset is probabilistically homogeneous. ni.e., when the discrepancy between various viewpoints is small. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 134 Homogeneity of Viewpoints (cont.) nThe index of Homogeneity of Viewpoints (HV) for a metric space M=(D,d) is: n n q where p1 and p2 are random objects and the discrepancy between two viewpoints is n n q q where Fpi is the viewpoint of pi Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 135 Homogeneity of Viewpoints (cont.) nIf HV(M) » 1, the overall distance distribution can be reliably used to replace any viewpoint. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 136 Proximity of Ball Regions nProximity of two regions is a measure that estimates the number of objects contained in their overlap nUsed in: qRegion splitting for partitioning nAfter splitting one region, the new regions should share as little objects as possible. qDisk allocation nEnhancing performance by distributing data over several disks. qApproximate search nApplied in relaxed branching strategy – a region is accessed if there is high probability to have objects in the intersection. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 137 Proximity of Ball Regions (cont.) nIn Euclidean spaces, it is easy to obtain qcompute data distributions qcompute integrals of data distribution on regions’ intersection nIn metric spaces qcoordinates cannot be used ndata distribution cannot be exploited qdistance density/distribution is the only available statistical information Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 138 Proximity of Ball Regions: Partitioning nQueries usually follow data distribution nPartition data to avoid overlaps, i.e. accessing both regions. qLow overlap (left) vs. high overlap (right) p2 p1 q p2 p1 q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 139 Proximity of Ball Regions: Data Allocation nRegions sharing many objects should be placed on different disk units – declustering qBecause there is high probability of being accessed together by the same query. p2 p1 q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 140 Proximity of Ball Regions: Approximate Search nSkip visiting regions where there is low chance to find objects relevant to a query. p2 p1 q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 141 Proximity of Metric Ball Regions nGiven two ball regions R1=(p1,r1) and R2=(p2,r2), we define proximity as follows: n n nIn real-life datasets, distance distribution does not depend on specific objects qReal datasets have a high index of homogeneity. nWe define the overall proximity n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 142 Proximity of Metric Ball Regions (cont.) nOverall proximity: Triangle inequality: Dz ≤ D1 + D2 Proximity: Probability that an object o appears in the intersection. p2 p1 D2 ≤ r2 o r1 r2 Dz = z D1 ≤ r1 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 143 Proximity: Computational Difficulties nLet D1= d(p1, o), D2= d(p2, o), Dz= d(p1, p2 ) be random variables, the overall proximity can be mathematically evaluated as n n n n nAn analytic formula for the joint conditional density n is not known for generic metric spaces. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 144 Proximity: Computational Difficulties (cont.) nIdea: Replace the joint conditional density fD1,D2|Dz(x,y|z) with the joint density fD1,D2(x,y). qHowever, these densities are different. qThe joint density is easier to obtain: q q nIf the overall density is used: q q nThe original expression can only be approximated. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 145 Proximity: Considerations (cont.) nThe joint conditional density is zero qWhen x,y and z do not satisfy the triangle inequality. qSimply such distance cannot exist in metric space. joint conditional density Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 146 Proximity: Considerations (cont.) joint density nThe joint density is not restricted qIdea: the joint conditional density is obtained by dragging values out of borders of the triangle inequality to the border. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 147 Proximity: Approximation nProximity can be computed in O(n) with high precision qn is the number of samples for the integral computation of f(x). qDistance density and distribution are the only information that need to be pre-computed and stored. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 148 Performance Prediction nDistance distribution can be used for performance prediction of similarity search access methods qEstimate the number of accessed subsets qEstimate the number of distance computations qEstimate the number of objects retrieved n nSuppose a dataset was partitioned in m subsets nSuppose every dataset is bounded by a ball region Ri=(pi,ri), 1≤ i ≤m, with the pivot pi and radius ri Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 149 Performance Prediction: Range Search nA range query R(q,rq) will access a subset bounded by the region Ri if it intersects the query qi.e., if d(q,pi) ≤ ri+rq nThe probability for a random region Rr=(p,r) to be accessed is q q q where p is the random centre of the region, Fq is the q’s viewpoint, and the dataset is highly homogeneous. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 150 Performance Prediction: Range Search (cont.) nThe expected number of accessed subsets is obtained by summing the probability of accessing each subset: q q q qprovided that we have a data structure to maintain the qri’s. n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 151 Performance Prediction: Range Search (cont.) nThe expected number of distance computations is obtained by summing the size of subsets and using the probability of accessing as a weight q q q nThe expected size of the result is given simply as q q qwhere n is the cardinality of the entire dataset. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 152 Performance Prediction: Range Search (cont.) nData structure to maintain the radii and the cardinalities of all bounding regions in needed qThe size of this information can become unacceptable – grows linearly with the size of the dataset. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 153 Performance Prediction: Range Search (cont.) nPrevious formulas can be reliably approximated by using the average information on each level of a tree (more compact) q q q q q q where Ml is the number of subsets at level l, arl is the average covering radius at level l, and L is the total number of levels. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 154 Performance Prediction: k-NN Search nThe optimal algorithm for k-NN(q) would access all regions that intersect R(q,d(q,ok)), where ok is the k-th nearest neighbor of q. qThe cost would be equal to that of the range query R(q,d(q,ok)) qHowever d(q,ok) is not known in advance. qThe distance density of ok (fOk) can be used instead q q qThe density fOk is the derivative of FOk Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 155 Performance Prediction: k-NN Search (cont.) nThe expected number of accessed subsets is obtained by integrating the cost of a range search multiplied by the density of the k-th NN distance n n nSimilarly, the expected number of distance computations is n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 156 Tree Quality Measures nConsider our hypothetical index structure again nWe can build two different trees over the same dataset o1 o5 o3 o2 o4 o6 o7 o8 o9 o1 o5 o3 o2 o4 o6 o7 o8 o9 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 157 Tree Quality Measures (cont.) nThe first tree is more compact. qOccupation of leaf nodes is higher. qNo intersection between covering regions. o1 o5 o3 o2 o4 o6 o7 o8 o9 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 158 Tree Quality Measures (cont.) nThe second tree is less compact. qIt may result from deletion of several objects. qOccupation of leaf nodes is poor. qCovering regions intersect. qSome objects are in the q intersection. o1 o5 o3 o2 o4 o6 o7 o8 o9 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 159 Tree Quality Measures (cont.) nThe first tree is ‘better’! nWe would like to measure quality of trees. o1 o5 o3 o2 o4 o6 o7 o8 o9 o1 o5 o3 o2 o4 o6 o7 o8 o9 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 160 Tree Quality Measures: Fat Factor nThis quality measure is based on overlap of metric regions. q q q nDifferent from the previous concept of overlap estimation. qIt is more local. qNumber of objects in the overlap divided by the total number of objects in both the regions. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 161 Fat Factor (cont.) n“Goodness” of a tree is strictly related to overlap. qGood trees are with overlaps as small as possible. n nThe measure counts the total number of node accesses required to answer exact match queries for all database objects. qIf the overlap of regions R1 and R2 contains o, both corresponding nodes are accessed for R(o,0). q Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 162 Absolute Fat Factor: definition nLet T be a metric tree of n objects with height h and m ≥ 1 nodes. The absolute fat-factor of T is: n n n n nIC – total number of nodes accessed during n exact match query evaluations: from nh to nm n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 163 Absolute Fat Factor: Example nAn ideal tree needs to access just one node per level. qfat(Tideal) = 0 ® IC=nh nThe worst tree always access all nodes. qfat(Tworst) = 1 ® IC=nm Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 164 Absolute Fat Factor: Example nIC=11 nfat(T)=0.2 nIC=10 nfat(T)=0 nTwo trees organizing 5 objects: qn=5 m=3 h=2 q o1 o5 o3 o2 o4 o1 o5 o3 o2 o4 Lines between objects determines the object’s membership to the same node. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 165 Absolute Fat Factor: Summary nAbsolute fat-factor’s consequences: qOnly range queries taken into account nk-NN queries are special case of range queries qDistribution of exact match queries follows distribution of data objects nIn general, it is expected that queries are issued in dense regions more likely. nThe number of nodes in a tree is not considered. qA big tree with a low fat-factor is better than a small tree with the fat-factor a bit higher. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 166 Relative Fat Factor: Definition nPenalizes trees with more than minimum number of nodes. nLet T be a metric tree with more than one node organizing n objects. The relative fat-factor of T is defined as: n n qIC – total number of nodes accessed qC – capacity of a node in objects qMinimum height: qMinimum number of nodes: n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 167 o1 o5 o3 o2 o4 o6 o7 o8 o9 o1 o5 o3 o2 o4 o6 o7 o8 o9 Relative Fat Factor: Example nMinimum tree qIC=18 h=2 m=4 qrfat(T)=0 fat(T)=0 nNon-optimal tree qIC=27 h=3 m=8 qrfat(T)=0.5 fat(T)=0 nTwo trees organizing 9 objects: qn=9 C=3 hmin=2 mmin=4 n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 168 Tree Quality Measures: Conclusion nAbsolute fat-factor q0 ≤ fat(T) ≤ 1 qRegion overlaps on the same level are measured. qUnder-filled nodes are not considered. qCan this tree be improved? n nRelative fat-factor qrfat(T) ≥ 0 qMinimum tree is optimal qOverlaps and occupations are considered. qWhich of these trees is more optimal? Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 169 Choosing Reference Points nAll but naïve index structures need pivots (reference objects). nPivots are essential for partitioning and search pruning. nPivots influence performance: qHigher & more narrowly-focused distance density with respect to a pivot q q qGreater change for a query object to be located at the most frequent distance to the pivot. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 170 Choosing Reference Points (cont.) nPivots influence performance: qConsider ball partitioning: qThe distance dm is the most frequent. q q q qIf all other distance are not very different q q qBoth subsets are very likely to be accessed by any query. p dm Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 171 Choosing Reference Points: Example nPosition of a “good” pivot: qUnit square with uniform distribution q3 positions: midpoint, edge, corner qMinimize the boundary length: nlen(pm)=2.51 nlen(pe)=1.256 nlen(pc)=1.252 nThe best choice is at the border of space nThe midpoint is the worst alternative. qIn clustering, the midpoint is the best. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 172 Choosing Reference Points: Example nThe shortest boundary has the pivot po outside the space. pm pe pc po Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 173 Choosing Reference Points: Example nDifferent view on a “good” pivot: q20-D Euclidean space qDensity with respect to a corner pivot is flatter. qDensity with respect to a central pivot is sharper & thinner. distance center corner Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 174 Choosing Good Pivots nGood pivots should be outliers of the space qi.e. an object located far from the others qor an object near the boundary of the space. n nSelecting good pivots is difficult qSquare or cubic complexities are common. qOften chosen at random. nEven being the most trivial and not optimizing, many implementations use it! Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 175 Choosing Reference Points: Heuristics nThere is no definition of a corner in metric spaces nA corner object is ‘far away’ from others n nAlgorithm for an outlier: 1.Choose a random object 2.Compute distances from this object to all others 3.Pick the furthest object as pivot nThis does not guarantee the best possible pivot. qHelps to choose a better pivot than the random choice. qBrings 5-10% performance gain Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 176 Choosing More Pivots nThe problem of selecting more pivots is more complicated - pivots should be fairly far apart. nAlgorithm for choosing m pivots: qChoose 3m objects at random from the given set of n objects. qPick an object. The furthest object from this is the first pivot. qSecond pivot is the furthest object from the first pivot. qThe third pivot is the furthest object from the previous pivots. Minimum min(d(p1 ,p3), d(p2 ,p3)) is maximized. q… qUntil we have m pivots. 1. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 177 Choosing More Pivots (cont.) nThis algorithm requires O(3m·m) distance computations. qFor small values of m, it can be repeated several times for different candidate sets and qthe best setting is used. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 178 Choosing Pivots: Efficiency Criterion nAn algorithm based on efficiency criterion: qMeasures ‘quality’ of sets of pivots. qUses the mean distance mD between pairs of objects in D. q nHaving two sets of pivots qP1={p1, p2,…pt } qP2={p’1, p’2,…p’t } nP1 is better than P2 when mDP1 > mDP2 n Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 179 Choosing Pivots: Efficiency Criterion nGiven a set of pivots P={p1, p2,…pt } nEstimation of mDP for P: 1.At random choose l pairs of objects {(o1,o’1), (o2,o’2), … (ol,o’l)} from database X Í D 2.Map all pairs into the feature space of the set of pivots P q Y(oi)=(d(p1,oi), d(p2,oi),…d(pt,oi)) q Y(o’i)=(d(p1,o’i), d(p2,o’i),…d(pt,o’i)) 1.For each pair (oi,o’i) compute their distance in the feature space: di=L¥(Y(oi),Y(o’i)). 2.Compute mD P as the mean of di : Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 180 Efficiency Criterion: Example nHaving P={p1,p2} nMapping used by mDP: Original space with d Feature space with L¥ o3 o2 o1 2 1 p1 p2 p2 p1 o3 o2 o1 3 3 2.24 2.83 Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 181 Choosing Pivots: Incremental Selection nSelects further pivots “on demand” qBased on efficiency criterion mDP nAlgorithm: 1.Select a sample set of m objects. 2.P1={p1} is selected from the sample as mDP1 is maximum. 3.Select another sample set of m objects. 4.Second pivot p2 is selected as: mDP2 is maximum where P2={p1,p2} with p1 fixed. 5.… qTotal cost for selecting k pivots: 2lmk distances qNext step would need 2lm distance, if distances di for computing mDP are kept. Similarity Search: Part I, Chapter 1 ‹#› Similarity Search: Part I, Chapter 1 182 Choosing Reference Points: Summary nCurrent rules are: qGood pivots are far away from other objects in the metric space. qGood pivots are far away from each other. n nHeuristics sometimes fail: qA dataset with Jaccard’s coefficient qThe outlier principle would select pivot p such that d(p,o)=1 for any other database object o. qSuch pivot is useless for partitioning & filtering!