Optimizing Query Performance in Metric Spaces Matej Antol Content  Background and motivation - indexing metric spaces and query evaluation  Motivation for query evaluation optimization  Approach #1 – Inverted Cache Index (ICI)  Approach #2 – Hybrid strategies for priority queue creation  Conclusions and future work 2 Indexing in metric spaces  Indexes based on objects’ mutual distances  No coordinate system can be used to split data space  Typically data-driven partitioning/clustering  M-tree  Clusters objects bottom up (like B-tree / R-tree)  M-index  Partitions space top down (recursive Voronoi partitioning) 3 M-tree example o7 o1 o6 o10 o3 o2 o5 o4 o9 o8 o11 o1 4.5 -.- o2 6.9 -.o1 1.4 0.0 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3 o2 0.0 o8 2.9o1 0.0 o6 1.4 o10 0.0 o3 1.2 o7 0.0 o5 1.3 o11 1.0 o4 0.0 o9 1.6 Covering radius Distance to parent Distance to parent 4 NN Search Algorithm 3-NN(q):  Process B1  Process B2  Process B4  Process B5  Skip B3  PQ is empty, quit. B2 o1 o3 o2 o4 o6 o7 o8 o9 B1 Response: O8, _ , _ Processing: o5 o2 o3 o6 o7 B5 B3 B1 Final result B4B2 o8 o9 o4 o1 B2 B4 B5 B3 B3 B4 _ , _ , _O8, O9 , _O8, O9 , O4O8, O9 , O1 B4B5B3done o5 B1 B5 B2 5 q Motivation: Querying performance  CoPhIR dataset  1 million images, 5 MPEG7 features per image, one weighted distance function  Querying using 1000 selected queries – 30NN query  1 query enters around 1000 leaf nodes (in case when total number of l.n. is 1124)  Avg. number of leaf nodes containing answer objects is ~17  Avg. position of last positive leaf node is ~230  First positive leaf node is typically within first 5 visited leaf nodes 6 App. #1 - Inverted Cache Index (ICI)  Remembering number of times an object/node was part of an answer to prior queries.  ICI value of any node equals sum of ICI values of all its children o1 1.4 0.0 o10 0.0 o3 1.2 Covering radius o10 1.2 3.3 o1 0.0 o6 1.4 Distance to parent ICI 247 1 6 19 5 7 Inverted Cache Index (ICI)  Use ICI with the distance between  query and node  query and object  ICI can be depicted as a “mass” of the object/node  So creating “attraction force” that pulls the query closer  Priority queue in kNN algorithm is ordered by this modified “distance” original distanceo q o q’ q ICI = x o q’’ q ICI = 2x q’ 𝑑𝑖𝑠𝑡𝐼𝐶𝐼 𝑑𝑖𝑠𝑡𝐼𝐶𝐼 8 Naïve ICI: Mass Distance (qd) Formula of Naïve ICI: 𝑚𝑎𝑠𝑠 = log 𝑏 𝐼𝐶𝐼 + 𝑏 𝑑𝑖𝑠𝑡𝐼𝐶𝐼 = Τ𝑑 𝑚𝑎𝑠𝑠 Where: b = selected log base d = distance in original metric space ICI = 20, b = 10 9 Extended ICI: Gravity Distance (qgd) Formula of Extended ICI: 𝑚𝑎𝑠𝑠 = log 𝑏 𝐼𝐶𝐼 + 𝑏 𝑚𝑎𝑠𝑠 𝑔𝑟𝑎𝑣𝑖𝑡𝑦 = Τ𝑚𝑎𝑠𝑠 Τ𝑑 𝑚𝑎𝑥𝑑𝑖𝑠𝑡 𝑝 + 1 𝑑𝑖𝑠𝑡𝐼𝐶𝐼 = Τ𝑑 𝑚𝑎𝑠𝑠 𝑔𝑟𝑎𝑣𝑖𝑡𝑦 Where: d = distance in original metric space maxdist = maximum distance b = base of logarithm, mass growth p = power of normalized distance (how strong gravitation force is) 10 ICI = 20, b = 10, p = 2, maxdist = 10 Experiment Protocol  CoPhIR dataset  1 million images, 5 MPEG7 features per image, one weighted distance function  M-tree and M-index structures  varying leaf node capacity  Queries  1000 most repeated queries w.r.t. Google Analytics on Mufin Demo App  Experiments 1. Proof of concept – M-tree, l.n. cap = 200 comparison of original results with naïve and various forms of extended ICI 2. Results on M-tree, l.n. cap. = 2000, different learning and testing datasets 3. Results on M-index, l.n. cap. = 2000, different learning and testing datasets  Comparison measure  % of queries completed for 30-NN 11 Proof of concept – M-tree 12 %ofqueriescompleted Leaf node capacity 200 Total leaf nodes 11 571 Different distance alteration approaches Results on M-tree 13 %ofqueriescompleted Leaf node capacity 2000 Learning on 1 year traffic (2009) Testing on consequent 1 month traffic (1/2010) 14 Results on M-index%ofqueriescompleted Leaf node capacity 2000 Learning on 1 year traffic (2009) Testing on consequent 1 month traffic (1/2010) Overall improvement Indexing structure log-pwr Orig. visited l.n. 95% q. completed ICI visited l.n. 95% q. completed Improvement M-tree 200 5-5 4600 4200 8,7 % M-tree 2000 10-2 590 470 20,5 % M-index 200 10-5 8000 6000 25 % M-index 2000 10-5 1500 950 37 % 15 App. #2 - Hybrid strategies for priority queues 16  Concluded from deeper analysis of queries in metric spaces  Simple tool for processing and visualization of the data  Distances  Nodes radii  Number of objects within leaf nodes  ICI values  Distances according to different queries  lower bound  Precise  upper bound Priority queues – current state 17 Three basic strategies are being used to create priority queues: lower bound, upper bound and precise LOWER BOUND p1, p3, p2; prunes p4 visits small and further buckets later (p2) UPPER BOUND p1, p2, p4, p3 visits larger buckets later (p3) PRECISE p1, p2, p3; prunes p4 does not take “density” into account at all q (5NN) p3 p1 p2 p4 Priority queues – current state 18  Current strategies are discrete – probably because of their intuitive representation  Better priority queues can be constructed depending on density, size, no. of objects, etc… q p LB precise UB X? LB = dist – rad precise = dist UB = dist + rad X = dist – 1/3 rad ? … Priority queues performance analysis 19  Best performing method (on our dataset) is lower bound  We compared different parameters of the structure and queries (distances, nodes radii, number of objects, ICI values, distances to queries)  Preliminary results show that suitable compound of 2 (or more) strategies can lead to better results 20 21 22 Suggested method 23  We are trying to determine the best way to combine lower bound and precise strategy  The simplest approach is left untested (so far) // X = dist – 1/3 rad  Suggested method follows the data projection on average linear regression of coordinates of positive buckets precise lb (0,0) order = max (k*(dst-rad) + q, dst) dst = |p,q| rad = size(p) empirical (average) values: k=0,45 q=1900 // k*(dst-rad) + q denotes parametric form of linear regression (y = k*x + q) Preliminary performance 24  Total number of buckets in our structure is 11 571  Last positive bucket has average position in priority queue 1 775  Proof of concept setup in simulation has average position of last bucket around 1 343 (24% improvement)  Proposed solution could outperform currently used strategies by tens of per cents  Method is simple, clean and does not require any adjustments in indexing structures Conclusions  Inverted Cache Index  Journal paper  Application to approximate KNN query evaluation  Hybrid strategies for priority queues  First tests on real data  Testing variety of strategies (alternatives to linear regression [max coordinates, …])  Determining the relation between structure parameters and strategy setup (dependence on density, avg dist, avg rad, number of objects, structure depth, …) Future Work  Inverted Cache Index  Shows improvement greater than 25% for a state-of-the-art indexing structures  Successful paper on ADBIS conference; paper was selected to be sent to impact journal  Hybrid strategies for priority queues  New, yet untested method with promising future  25 Thank you for your attention Matej Antol 26