Queries for Similarity Analytics Vlasta Dohnal Nov. 15, 2016 Seminar of DISA Lab 1 Data Analytics • a process of analyzing data and presenting results to users • to make informed decisions • tools and applications (e.g. Data Warehouse) • to collect data • to prepare it for storage and analysis • to develop and run queries • to create reports and dashboards • to visualize data • evolved from decision support systems • business analytics / (advanced) data analytics • prescriptive analytics 2 Outline • Motivation, examples • Similarity operators • Similarity Selection / Join / Set / Group By • Operator evaluation • Consistent and efficient • Extensions to similarity group by • Conclusion • Credits • Silva, Y.N., Aref, W.G., Larson, PA., Pearson S.S., Ali M.H.: Similarity queries: their conceptual evaluation, transformations, and processing. VLDB Journal, vol. 22, pp 395–420, 2013. • Tang, Mingjie, Tahboub, Ruby Y., Aref, Walid G., Atallah, Mikhail J., Malluhi, Qutaibah M., Ouzzani, Mourad, Silva, Yasin N. Similarity Group-By operators for multi-dimensional relational data. ICDE, pp 1448-1449, 2016. • Al Marri, W. J., Malluhi, Q., Ouzzani, M., Tang, M., Aref, W. G. The similarity-aware relational database set operators. Information Systems, vol. 59, pp. 79-93, Elsevier, 2015. Nov. 15, 2016 Seminar of DISA Lab 3 Motivation & Examples • Data processing • Searching that allows some “fuzzy” comparison between data objects • Database systems • Make them similarity-aware • Example queries • Find the closest three suppliers from my location •  k-nearest neighbor query • Find the cheapest gas station within 20km •  range query with order by/limit Nov. 15, 2016 Seminar of DISA Lab 4 Motivation & Examples • More examples • Find the closest three suppliers for every customer within 100 miles from our Chicago headquarters •  range query on customers and kNN-join with suppliers • Considering the customers that are located within 200 miles from our Chicago headquarters, cluster the customers around certain locations of interest, and report the size of each cluster •  range query on customers and group by around some points and compute aggregates • For every customer, identify its closest 3 suppliers and for each such supplier, identify its closest 2 potential new suppliers •  kNN-join with suppliers and kNN-join with pot. suppliers Nov. 15, 2016 Seminar of DISA Lab 5 Similarity Query Language • SQL extended with similarity operators • Implemented in a relational DBMS • Extended query grammar • Extended query optimizer • Requires statistics about similarity operators • Implemented similarity operators • Applied on TPC-H benchmark • E.g. Retrieve customers with similar buying power and account balance Nov. 15, 2016 Seminar of DISA Lab 6 Similarity Operators • Assume • a dataset X, i.e. a set of data objects (o, p, q,..., x, y), • a distance function d(o1,o2), and • descriptors (attributes/properties) of objects: o.A, o.B, … • S. Selection • k-nearest neighbor query (kNN) • range query • combination (kNN(q,r)) • S. Join • -join • kNN-join • self joins • S. Intersection, Union, Difference • S. Group By • Around selected points • Unsupervised grouping •  More operators used within one query Nov. 15, 2016 Seminar of DISA Lab 7 Similarity Selection (SS) • Well-known similarity queries on a dataset 𝜎Θ 𝑆(𝑥.𝐴,𝑐.𝐴) 𝑋 = 𝑥|Θ 𝑆 𝑥. 𝐴, 𝑐. 𝐴 , 𝑥 ∈ 𝑋 • ε-selection • Θ 𝜀=𝒓(𝑥. 𝐴, 𝑞. 𝐴) ≔ (𝑑(𝑥. 𝐴, 𝑞. 𝐴) ≤ 𝒓) • Alt. Θ 𝜀,𝑞.𝐴(𝑥. 𝐴) • kNN selection • Θ 𝑘𝑁𝑁=𝒌 𝑥. 𝐴, 𝑞. 𝐴 ≔ true if 𝑥 ∈ 𝒌𝑁𝑁 𝑞. 𝐴 • Alt. Θ 𝑘𝑁𝑁,𝑞.𝐴(𝑥. 𝐴) • If there are more objects at the distance of kth neighbor, all are reported. Nov. 15, 2016 Seminar of DISA Lab 8 Similarity Join (SJ) • Extends regular join by identifying similar pairs instead of equal ones 𝑋Θ 𝑆(𝑥.𝐴,𝑦.𝐵) 𝑌 = 𝑥, 𝑦 |Θ 𝑆 𝑥. 𝐴, 𝑦. 𝐵 , 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌 Θ 𝑆 𝑥. 𝐴, 𝑦. 𝐵 - similarity predicate 𝜎Θ 𝑆 𝑥.𝐴,𝑦.𝐴 𝑋 × 𝑌 • Variants: • Range distance join (ε-join) • k-nearest neighbor join (kNN-join) • k-distance join (kD-join) • Join around (Join-Around) • Wide-join (by Traina group) Nov. 15, 2016 Seminar of DISA Lab 9 Similarity Join (SJ) – Variants • Range distance join (ε-join) • Θ 𝜀 𝑥. 𝐴, 𝑦. 𝐵 ≔ 𝑑(𝑥. 𝐴, 𝑦. 𝐵) ≤ 𝜀 Nov. 15, 2016 Seminar of DISA Lab 10 Similarity Join (SJ) – Variants • k-nearest neighbor join (kNN-join) • Θ 𝑘𝑁𝑁 𝑥. 𝐴, 𝑦. 𝐵 ≔ true if 𝑦. 𝐵 ∈ 𝑘𝑁𝑁 𝑥. 𝐴 on 𝑌. 𝐵 • Note for kNN: • If there are more objects at the distance of kth neighbor, all are reported. Nov. 15, 2016 Seminar of DISA Lab 11 Similarity Join (SJ) – Variants • k-distance join (kD-join) • Θ 𝑘𝐷 𝑥. 𝐴, 𝑦. 𝐵 ≔ true if 𝑥. 𝐴, 𝑦. 𝐵 ∈ overall 𝑘 closest pairs in 𝑋. 𝐴 × 𝑌. 𝐵 • Note for kD: • If there are more pairs with the kth distance, all are reported. Nov. 15, 2016 Seminar of DISA Lab 12 Similarity Join (SJ) – Variants • Join around (Join-Around) • Θ 𝐴,𝑀𝐷=2𝑟 𝑥. 𝐴, 𝑦. 𝐵 ≡ Θ1𝑁𝑁,2𝑟 𝑥. 𝐴, 𝑦. 𝐵 ≔ true if 𝑦. 𝐵 ∈ 1𝑁𝑁 𝑥. 𝐴, 𝑟 on 𝑌. 𝐵 • i.e. 𝑦. 𝐵 is the closest neighbor of 𝑥. 𝐴 and 𝑑(𝑥. 𝐴, 𝑦. 𝐵) ≤ 𝑟 • Note for 1𝑁𝑁, 2𝑟: • If there are more objects at the closest distance, all are reported. Nov. 15, 2016 Seminar of DISA Lab 13 Combining Operators • Using the relational algebra style… • Multiple predicates • Different selection predicates • 𝜎Θ 𝜀 𝑥.𝐴,𝑞1.𝐴 ∧Θ 𝑘𝑁𝑁 𝑥.𝐴,𝑞2.𝐴 𝑋 • Multiple operators • 𝜎Θ 𝜀 𝑥.𝐴,𝑞1.𝐴 𝜎Θ 𝑘𝑁𝑁 𝑥.𝐴,𝑞2.𝐴 𝑋 • Equivalence of operators • Similarity join vs. similarity selection • 𝑋Θ 𝜀 𝑥.𝐴,𝑦.𝐵 𝑌 ≡ 𝜎Θ 𝜀 𝑥.𝐴,𝑦.𝐵 𝑋 × 𝑌 Nov. 15, 2016 Seminar of DISA Lab 14 Combining Operators: Order Matters • Query with C1, C2 q. objects: 𝜎Θ 𝜀,𝐶1 𝑒 ∧Θ 𝑘𝑁𝑁,𝐶2 𝑒 𝐸 Nov. 15, 2016 Seminar of DISA Lab 15 Combining Operators: Conceptual Query Plan • Combine sub-results with intersection  Consistent Evaluation Nov. 15, 2016 Seminar of DISA Lab 16 Optimizing Query Plan • Query plan = a plan of executing individual operations to get query result • Conceptual query plan is not optimal • Same data can be read multiple time • Equivalence rules • Swapping operations in a plan to keep it equivalent to conceptual plan • Type of similarity predicates in operations define their order • kNN type has priority over range! Nov. 15, 2016 Seminar of DISA Lab 17 Selection Predicates • 𝜎Θ 𝑆1,𝐶1 𝑥 ∧Θ 𝑆2,𝐶2 𝑥 𝑋 ≡ 𝜎Θ 𝑆1,𝐶1 𝑥 𝜎Θ 𝑆2,𝐶2 𝑥 𝑋 iff there is an edge S2S1 Nov. 15, 2016 Seminar of DISA Lab 18 Selection Predicates: kNN • It cannot be established, so must be executed independently •  Implement a special “multi-kNN” operator? Nov. 15, 2016 Seminar of DISA Lab 19 Selection and Join: Distributivity a) 𝜎Θ 𝑆 𝒚.𝑩,𝐶 𝑋Θ 𝑆 𝑥.𝐴,𝑦.𝐴 𝑌 b) 𝜎Θ 𝑆 𝒙.𝑩,𝐶 𝑋Θ 𝑆 𝑥.𝐴,𝑦.𝐴 𝑌 Nov. 15, 2016 Seminar of DISA Lab 20 Selection and Join: Example Query • Find the closest three suppliers for every customer within 100 miles from our Chicago headquarters (X,Y) •  range query on customers and kNN-join with suppliers 𝜎Θ 𝜀=100,𝐶=(𝑋,𝑌) 𝑐_𝑙𝑜𝑐 𝐶Θ 𝑘𝑁𝑁=3 𝑐_𝑙𝑜𝑐,𝑠_𝑙𝑜𝑐 𝑆 𝜎Θ 𝜀=100,𝐶=(𝑋,𝑌) 𝑐_𝑙𝑜𝑐 𝐶 Θ 𝑘𝑁𝑁=3 𝑐_𝑙𝑜𝑐,𝑠_𝑙𝑜𝑐 𝑆 Nov. 15, 2016 Seminar of DISA Lab 21 SELECT c_custkey, s_suppkey FROM CUSTOMER c, SUPPLIER s WHERE c_loc WITHIN 100 OF (X,Y) AND s_loc 3 TOP_CLOSEST_NEIGHBOR_OF c_loc; Combining Joins • Commutativity • Yes: ε-join, kD-join (and distance function is symmetric) • No: kNN-join, Join-Around • Associativity 𝐸Θ 𝑆 𝑒.𝐴,𝑓.𝐴 𝐹Θ 𝑆 𝑓.𝐵,𝑔.𝐵 𝐺 • Yes: ε-join, kNN-join, Join-Around • No: kD-join • “Commutativity” of unrelated datasets: 𝐸Θ 𝑆 𝑒.𝐴,𝑓.𝐴 𝐺Θ 𝑆 𝑔.𝐴,𝑓.𝐴 𝐹 • Yes: ε-join • No: kNN-join, Join-Around, kD-join Nov. 15, 2016 Seminar of DISA Lab 22 Joins: Example Query • For every customer, identify its closest 3 suppliers and for each such supplier, identify its closest 2 potential new suppliers Nov. 15, 2016 Seminar of DISA Lab 23 SELECT c_custkey, s_suppkey, psu_suppkey FROM CUSTOMER c, SUPPLIER s, POT_SUPPLIER psu WHERE s_loc 3 TOP_CLOSEST_NEIGHBOR_OF c_loc AND psu_loc 2 TOP_CLOSEST_NEIGHBOR_OF s_loc; Selection and Join: Combining ε-predicates • Selection pull-up & push-down: • Used in relational DBMS to further optimize the query • 𝜎Θ 𝜀1,𝐶(𝑥.𝐴) 𝑋 Θ 𝜀2(𝑥.𝐴,𝑦.𝐴) 𝑌 ≡ 𝜎Θ 𝜀1,𝐶(𝑥.𝐴) 𝑋Θ 𝜀2(𝑥.𝐴,𝑦.𝐴) 𝑌 ≡ 𝜎Θ 𝜀1,𝐶(𝑥.𝐴) 𝑋 Θ 𝜀2(𝑥.𝐴,𝑦.𝐴) 𝜎Θ(𝜀1+𝜀2),𝐶(𝑦.𝐴) 𝑌 Nov. 15, 2016 Seminar of DISA Lab 24 Transformation Rules: Example Nov. 15, 2016 Seminar of DISA Lab 25 Transformation Rules: Example Nov. 15, 2016 Seminar of DISA Lab 26 • No Cartesian product is used • Datasets reused due to kNN selection Transformation Rules: Performance • Associativity of -join • Data • AccBalLevel1 • 110 different levels of account balance in [0;11000] • AccBalLelel2 • 11,000 dtto • Customer • 750,000 recs Nov. 15, 2016 Seminar of DISA Lab 27 SELECT * FROM CUSTOMER C, AccBalLevels1 R1, AccBalLevels2 R2 WHERE C acctbal WITHIN 11 OF R1.refpoint AND R1.refpoint WITHIN 11 OF R2.refpoint; 9.2% of LHS Similarity Set Operators • Similarity intersect / union / difference • Implemented in relational DB • Distance functions defined on regular attributes • Identify similar tuples using threshold distance  • Efficient implementation – 100x faster than using regular DB operators Nov. 15, 2016 Seminar of DISA Lab 28 Summary • Data analytics need multiple operators in a query • Consistent query evaluation is important • if no priority requested by the user • Some operator predicates are not commutative (involving kNN) • kNN must be performed as first! • Can kNN be constrained with  using some statistics? • Equivalence rules cannot optimize everything • Special “multi-query” operation over one database needed (to combine kNN) • Similarity group by • Applied as the last operator • Can be split and pushed down – eager aggregation Nov. 15, 2016 Seminar of DISA Lab 29 Similarity Group By • Extended syntax of regular group by 𝐺1,𝑆1 ,… 𝐺 𝑚,𝑆 𝑛 Γ 𝐹1 𝐴1 ,…𝐹 𝑛 𝐴 𝑛 (𝑋) 𝑆1 - segmentation of domain of 𝐺1 into non-overlapping groups 𝐹1 - aggregation on 𝐴1 of data objects in a group • Result is a set of objects with regular attributes/properties • Procedure: • Partition all data objects in the result into groups • Obtain group representatives • Compute aggregates Fi on all objects per group • i.e. each combination of segments (values) of all Gjs Nov. 15, 2016 Seminar of DISA Lab 30 Similarity Group By: Implementation • Task: Define segmentation for Gi • Using clustering • Multiple-pass external algorithms (k-means, hierarchical clustering) • very slow, may not be consistent • Approximation using sampling or summaries possible (BIRCH, CURE) – one-pass • still slow, hard to compute aggregates simultaneously, no pipelining with other operators • Special implementation •  control on constraint specification •  pipelining, aggregations during group-by, no approximations (like sampling) • Implemented in a query evaluation engine Nov. 15, 2016 Seminar of DISA Lab 31 Similarity Group By: Implementation • Variants: • Supervised – segmentation is defined in advance • E.g. cluster centers, dividing thresholds • Unsupervised – segmentation obtained from data • E.g. number of resulting clusters given • Segmentation properties: • Element separation s – each object has another object within distance s • Group diameter d – distance between the most separated objects ≤ d Nov. 15, 2016 Seminar of DISA Lab 32 Similarity Group By Around • Supervised: Around – SGB-A • Set of central points – define groups • objects assigned to the group of the closest central point • Element separation s and group diameter d (=2r) – optional • Some data objects might be excluded! Nov. 15, 2016 Seminar of DISA Lab 33 Similarity Group By Delimited • Supervised: Delimited – SGB-D • Set of delimiting hyperplanes • For nD space: 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎 𝑛 𝑥 𝑛 ≤ 𝑏 and 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎 𝑛 𝑥 𝑛 > 𝑏 • Element separation s and group diameter d – optional Nov. 15, 2016 Seminar of DISA Lab 34 Similarity Group By Unsupervised • Unsupervised – SGB-U • Element separation s – each object has another object within distance s • Group diameter d – distance between the most separated objects ≤ d Nov. 15, 2016 Seminar of DISA Lab 35 Similarity Group By Unsupervised • Unsupervised SGB extended to vector spaces • Element separation s – each object has another object within distance s • Distance function d – L2 or L, but not limited to • SGB-All • SGB-Any Nov. 15, 2016 Seminar of DISA Lab 36 Similarity Group By: Vector Spaces • SGB-All • An object o is in the group if its distance to all objects in the group is ≤ s () • An object may be part of more groups, so the on-overlap functions: • Join-any – a group out of the matching groups is picked at random • Eliminate – the object is eliminated from grouping • New-group – all objects overlapping more groups form a new group • Algorithm 𝒪 𝑛 log 𝐺 for vector spaces Nov. 15, 2016 Seminar of DISA Lab 37 =3, L2 Similarity Group By: Vector Spaces • SGB-Any • An object o is in the group if its distance to at least one object in the group is ≤ s () • All objects can be assigned uniquely • Algorithm 𝒪 𝑛 log 𝑛 for vector spaces Nov. 15, 2016 Seminar of DISA Lab 38 =3, L2 Summary: SGB for 1D-nD spaces • Efficient algorithms • comparable to regular Group-By, faster than clustering • Group representatives • SGB-Around • Given by default • SGB-Delimited • Order on boundaries (1D) • SGB-U/All/Any • Centroid of group • Regular aggregations • On a component of all points in a group (min/max/avg) – 1D • On points (polygon/convex hull) – nD Nov. 15, 2016 Seminar of DISA Lab 39 Similarity Group By: Distance Spaces • Parameters: • Element separation s – each object has another object within distance s • Group diameter 2 – distance between the most separated objects ≤ 2 • Variants: • SGB-Around • Voronoi partitioning using pivots • Element separation s, Group diameter 2 • Some objects might not be assigned to any group • Some objects might belong to more groups • SGB-Delimited • A set of hyperplanes defined by pivots 𝑝1, 𝑝2 ; similar to Voronoi partitioning • Any practical application? Nov. 15, 2016 Seminar of DISA Lab 40 Similarity Group By: Distance Spaces • Variants: • SGB-U • SGB-All  parameters s=, diameter= • SGB-Any  parameters s=, diameter= • SGB-P – permutation • Group is formed by objects having the same k-nearest pivots • Recursive application of SGB-Around • Parameters separation and diameter not applicable • Alternative: …. k-furthest pivots? • SGB-S – subset of pivots • “permutation” without ordering Nov. 15, 2016 Seminar of DISA Lab 41 Similarity Group By: Distance Spaces • SGB conflict handling when parameters (separation, diameter) given • on overlap • ELIMINATE • ASSIGN_TO_ANY - might not be random, due to consistency • ASSIGN_TO_ALL • FORM_NEW_GROUP • SET_NULL • on miss • ELIMINATE • FORM_NEW_GROUP • SET_NULL • Handle missing descriptors  “NULL” group • e.g. when more SGB attributes are combined Nov. 15, 2016 Seminar of DISA Lab 42 Similarity Group By: Distance Spaces • SGB with roll-up/drill-down on parameters • Gradually extend/shrink separation/diameter • Multiple assignment necessary • SGB with roll-up/drill-down/dice on attributes • Group representatives • Cluster medoid • Surrogate key • Result of an aggregate function Nov. 15, 2016 Seminar of DISA Lab 43 Similarity Group By: Distance Spaces • Aggregate functions: • COUNT • MIN • The object closest to the group representative • MAX • Furthest object from group representative • AVG • Artificial object (center)? • MEDOID • SUM • A concatenation of all objects? • SET • A bag of all original objects • COVER • a set of boundary objects • “skyline” objects (or “convex hull”) Nov. 15, 2016 Seminar of DISA Lab 44 Future Work • Combination of multiple predicates • Optimized algorithms • Similarity group by variants defined • Introduce parameter refinement – “drill-down” feature • Incorporate diversity in result • Need for new aggregate functions • Efficient algorithms for metric spaces • Many existing rely on sorting the attribute values • New query types • “Multi-query” kNN • Recursive self-join Nov. 15, 2016 Seminar of DISA Lab 45