Queries for Similarity Analytics
Vlasta Dohnal
Nov. 15, 2016 Seminar of DISA Lab 1
Data Analytics
• a process of analyzing data and presenting results to users
• to make informed decisions
• tools and applications (e.g. Data Warehouse)
• to collect data
• to prepare it for storage and analysis
• to develop and run queries
• to create reports and dashboards
• to visualize data
• evolved from decision support systems
• business analytics / (advanced) data analytics
• prescriptive analytics
2
Outline
• Motivation, examples
• Similarity operators
• Similarity Selection / Join / Set / Group By
• Operator evaluation
• Consistent and efficient
• Extensions to similarity group by
• Conclusion
• Credits
• Silva, Y.N., Aref, W.G., Larson, PA., Pearson S.S., Ali M.H.: Similarity queries: their conceptual evaluation,
transformations, and processing. VLDB Journal, vol. 22, pp 395–420, 2013.
• Tang, Mingjie, Tahboub, Ruby Y., Aref, Walid G., Atallah, Mikhail J., Malluhi, Qutaibah M., Ouzzani, Mourad,
Silva, Yasin N. Similarity Group-By operators for multi-dimensional relational data. ICDE, pp 1448-1449, 2016.
• Al Marri, W. J., Malluhi, Q., Ouzzani, M., Tang, M., Aref, W. G. The similarity-aware relational database set
operators. Information Systems, vol. 59, pp. 79-93, Elsevier, 2015.
Nov. 15, 2016 Seminar of DISA Lab 3
Motivation & Examples
• Data processing
• Searching that allows some “fuzzy” comparison between data objects
• Database systems
• Make them similarity-aware
• Example queries
• Find the closest three suppliers from my location
•  k-nearest neighbor query
• Find the cheapest gas station within 20km
•  range query with order by/limit
Nov. 15, 2016 Seminar of DISA Lab 4
Motivation & Examples
• More examples
• Find the closest three suppliers for every customer within 100 miles from our
Chicago headquarters
•  range query on customers and kNN-join with suppliers
• Considering the customers that are located within 200 miles from our Chicago
headquarters, cluster the customers around certain locations of interest, and
report the size of each cluster
•  range query on customers and group by around some points and compute aggregates
• For every customer, identify its closest 3 suppliers and for each such supplier,
identify its closest 2 potential new suppliers
•  kNN-join with suppliers and kNN-join with pot. suppliers
Nov. 15, 2016 Seminar of DISA Lab 5
Similarity Query Language
• SQL extended with similarity operators
• Implemented in a relational DBMS
• Extended query grammar
• Extended query optimizer
• Requires statistics about similarity operators
• Implemented similarity operators
• Applied on TPC-H benchmark
• E.g. Retrieve customers with similar buying power and account balance
Nov. 15, 2016 Seminar of DISA Lab 6
Similarity Operators
• Assume
• a dataset X, i.e. a set of data objects (o, p, q,..., x, y),
• a distance function d(o1,o2), and
• descriptors (attributes/properties) of objects: o.A, o.B, …
• S. Selection
• k-nearest neighbor query (kNN)
• range query
• combination (kNN(q,r))
• S. Join
• -join
• kNN-join
• self joins
• S. Intersection, Union, Difference
• S. Group By
• Around selected points
• Unsupervised grouping
•  More operators used within one query
Nov. 15, 2016 Seminar of DISA Lab 7
Similarity Selection (SS)
• Well-known similarity queries on a dataset
𝜎Θ 𝑆(𝑥.𝐴,𝑐.𝐴) 𝑋 = 𝑥|Θ 𝑆 𝑥. 𝐴, 𝑐. 𝐴 , 𝑥 ∈ 𝑋
• ε-selection
• Θ 𝜀=𝒓(𝑥. 𝐴, 𝑞. 𝐴) ≔ (𝑑(𝑥. 𝐴, 𝑞. 𝐴) ≤ 𝒓)
• Alt. Θ 𝜀,𝑞.𝐴(𝑥. 𝐴)
• kNN selection
• Θ 𝑘𝑁𝑁=𝒌 𝑥. 𝐴, 𝑞. 𝐴 ≔ true if 𝑥 ∈ 𝒌𝑁𝑁 𝑞. 𝐴
• Alt. Θ 𝑘𝑁𝑁,𝑞.𝐴(𝑥. 𝐴)
• If there are more objects at the distance of kth neighbor, all are reported.
Nov. 15, 2016 Seminar of DISA Lab 8
Similarity Join (SJ)
• Extends regular join by identifying similar pairs instead of equal ones
𝑋Θ 𝑆(𝑥.𝐴,𝑦.𝐵) 𝑌 = 𝑥, 𝑦 |Θ 𝑆 𝑥. 𝐴, 𝑦. 𝐵 , 𝑥 ∈ 𝑋, 𝑦 ∈ 𝑌
Θ 𝑆 𝑥. 𝐴, 𝑦. 𝐵 - similarity predicate
𝜎Θ 𝑆 𝑥.𝐴,𝑦.𝐴 𝑋 × 𝑌
• Variants:
• Range distance join (ε-join)
• k-nearest neighbor join (kNN-join)
• k-distance join (kD-join)
• Join around (Join-Around)
• Wide-join (by Traina group)
Nov. 15, 2016 Seminar of DISA Lab 9
Similarity Join (SJ) – Variants
• Range distance join (ε-join)
• Θ 𝜀 𝑥. 𝐴, 𝑦. 𝐵 ≔ 𝑑(𝑥. 𝐴, 𝑦. 𝐵) ≤ 𝜀
Nov. 15, 2016 Seminar of DISA Lab 10
Similarity Join (SJ) – Variants
• k-nearest neighbor join (kNN-join)
• Θ 𝑘𝑁𝑁 𝑥. 𝐴, 𝑦. 𝐵 ≔ true if 𝑦. 𝐵 ∈ 𝑘𝑁𝑁 𝑥. 𝐴 on 𝑌. 𝐵
• Note for kNN:
• If there are more objects at the distance of kth neighbor, all are reported.
Nov. 15, 2016 Seminar of DISA Lab 11
Similarity Join (SJ) – Variants
• k-distance join (kD-join)
• Θ 𝑘𝐷 𝑥. 𝐴, 𝑦. 𝐵 ≔ true if 𝑥. 𝐴, 𝑦. 𝐵 ∈ overall 𝑘 closest pairs in 𝑋. 𝐴 × 𝑌. 𝐵
• Note for kD:
• If there are more pairs with the kth distance, all are reported.
Nov. 15, 2016 Seminar of DISA Lab 12
Similarity Join (SJ) – Variants
• Join around (Join-Around)
• Θ 𝐴,𝑀𝐷=2𝑟 𝑥. 𝐴, 𝑦. 𝐵 ≡ Θ1𝑁𝑁,2𝑟 𝑥. 𝐴, 𝑦. 𝐵 ≔ true if 𝑦. 𝐵 ∈ 1𝑁𝑁 𝑥. 𝐴, 𝑟 on 𝑌. 𝐵
• i.e. 𝑦. 𝐵 is the closest neighbor of 𝑥. 𝐴 and 𝑑(𝑥. 𝐴, 𝑦. 𝐵) ≤ 𝑟
• Note for 1𝑁𝑁, 2𝑟:
• If there are more objects at the closest distance, all are reported.
Nov. 15, 2016 Seminar of DISA Lab 13
Combining Operators
• Using the relational algebra style…
• Multiple predicates
• Different selection predicates
• 𝜎Θ 𝜀 𝑥.𝐴,𝑞1.𝐴 ∧Θ 𝑘𝑁𝑁 𝑥.𝐴,𝑞2.𝐴 𝑋
• Multiple operators
• 𝜎Θ 𝜀 𝑥.𝐴,𝑞1.𝐴 𝜎Θ 𝑘𝑁𝑁 𝑥.𝐴,𝑞2.𝐴 𝑋
• Equivalence of operators
• Similarity join vs. similarity selection
• 𝑋Θ 𝜀 𝑥.𝐴,𝑦.𝐵 𝑌 ≡ 𝜎Θ 𝜀 𝑥.𝐴,𝑦.𝐵 𝑋 × 𝑌
Nov. 15, 2016 Seminar of DISA Lab 14
Combining Operators: Order Matters
• Query with C1, C2 q. objects: 𝜎Θ 𝜀,𝐶1 𝑒 ∧Θ 𝑘𝑁𝑁,𝐶2 𝑒 𝐸
Nov. 15, 2016 Seminar of DISA Lab 15
Combining Operators: Conceptual Query Plan
• Combine sub-results with intersection  Consistent Evaluation
Nov. 15, 2016 Seminar of DISA Lab 16
Optimizing Query Plan
• Query plan = a plan of executing individual operations to get query result
• Conceptual query plan is not optimal
• Same data can be read multiple time
• Equivalence rules
• Swapping operations in a plan to keep it equivalent to conceptual plan
• Type of similarity predicates in operations define their order
• kNN type has priority over range!
Nov. 15, 2016 Seminar of DISA Lab 17
Selection Predicates
• 𝜎Θ 𝑆1,𝐶1 𝑥 ∧Θ 𝑆2,𝐶2 𝑥 𝑋 ≡ 𝜎Θ 𝑆1,𝐶1 𝑥 𝜎Θ 𝑆2,𝐶2 𝑥 𝑋 iff there is an edge S2S1
Nov. 15, 2016 Seminar of DISA Lab 18
Selection Predicates: kNN
• It cannot be established, so must be executed independently
•  Implement a special “multi-kNN” operator?
Nov. 15, 2016 Seminar of DISA Lab 19
Selection and Join: Distributivity
a) 𝜎Θ 𝑆 𝒚.𝑩,𝐶 𝑋Θ 𝑆 𝑥.𝐴,𝑦.𝐴 𝑌
b) 𝜎Θ 𝑆 𝒙.𝑩,𝐶 𝑋Θ 𝑆 𝑥.𝐴,𝑦.𝐴 𝑌
Nov. 15, 2016 Seminar of DISA Lab 20
Selection and Join: Example Query
• Find the closest three suppliers for every customer within 100 miles
from our Chicago headquarters (X,Y)
•  range query on customers and kNN-join with suppliers
𝜎Θ 𝜀=100,𝐶=(𝑋,𝑌) 𝑐_𝑙𝑜𝑐 𝐶Θ 𝑘𝑁𝑁=3 𝑐_𝑙𝑜𝑐,𝑠_𝑙𝑜𝑐 𝑆
𝜎Θ 𝜀=100,𝐶=(𝑋,𝑌) 𝑐_𝑙𝑜𝑐 𝐶 Θ 𝑘𝑁𝑁=3 𝑐_𝑙𝑜𝑐,𝑠_𝑙𝑜𝑐 𝑆
Nov. 15, 2016 Seminar of DISA Lab 21
SELECT c_custkey, s_suppkey FROM CUSTOMER c, SUPPLIER s
WHERE c_loc WITHIN 100 OF (X,Y) AND s_loc 3 TOP_CLOSEST_NEIGHBOR_OF c_loc;
Combining Joins
• Commutativity
• Yes: ε-join, kD-join (and distance function is symmetric)
• No: kNN-join, Join-Around
• Associativity 𝐸Θ 𝑆 𝑒.𝐴,𝑓.𝐴 𝐹Θ 𝑆 𝑓.𝐵,𝑔.𝐵 𝐺
• Yes: ε-join, kNN-join, Join-Around
• No: kD-join
• “Commutativity” of unrelated datasets: 𝐸Θ 𝑆 𝑒.𝐴,𝑓.𝐴 𝐺Θ 𝑆 𝑔.𝐴,𝑓.𝐴 𝐹
• Yes: ε-join
• No: kNN-join, Join-Around, kD-join
Nov. 15, 2016 Seminar of DISA Lab 22
Joins: Example Query
• For every customer, identify its closest 3 suppliers and for each such
supplier, identify its closest 2 potential new suppliers
Nov. 15, 2016 Seminar of DISA Lab 23
SELECT c_custkey, s_suppkey, psu_suppkey FROM CUSTOMER c, SUPPLIER s, POT_SUPPLIER psu
WHERE s_loc 3 TOP_CLOSEST_NEIGHBOR_OF c_loc
AND psu_loc 2 TOP_CLOSEST_NEIGHBOR_OF s_loc;
Selection and Join: Combining ε-predicates
• Selection pull-up & push-down:
• Used in relational DBMS to further optimize the query
• 𝜎Θ 𝜀1,𝐶(𝑥.𝐴) 𝑋 Θ 𝜀2(𝑥.𝐴,𝑦.𝐴) 𝑌 ≡
𝜎Θ 𝜀1,𝐶(𝑥.𝐴) 𝑋Θ 𝜀2(𝑥.𝐴,𝑦.𝐴) 𝑌 ≡
𝜎Θ 𝜀1,𝐶(𝑥.𝐴) 𝑋 Θ 𝜀2(𝑥.𝐴,𝑦.𝐴) 𝜎Θ(𝜀1+𝜀2),𝐶(𝑦.𝐴) 𝑌
Nov. 15, 2016 Seminar of DISA Lab 24
Transformation Rules: Example
Nov. 15, 2016 Seminar of DISA Lab 25
Transformation Rules: Example
Nov. 15, 2016 Seminar of DISA Lab 26
• No Cartesian product
is used
• Datasets reused
due to kNN
selection
Transformation Rules: Performance
• Associativity of -join
• Data
• AccBalLevel1
• 110 different levels of account
balance in [0;11000]
• AccBalLelel2
• 11,000 dtto
• Customer
• 750,000 recs
Nov. 15, 2016 Seminar of DISA Lab 27
SELECT *
FROM CUSTOMER C, AccBalLevels1 R1, AccBalLevels2 R2
WHERE C acctbal WITHIN 11 OF R1.refpoint
AND R1.refpoint WITHIN 11 OF R2.refpoint;
9.2% of LHS
Similarity Set Operators
• Similarity intersect / union / difference
• Implemented in relational DB
• Distance functions defined on regular attributes
• Identify similar tuples using threshold distance 
• Efficient implementation – 100x faster than using regular DB operators
Nov. 15, 2016 Seminar of DISA Lab 28
Summary
• Data analytics need multiple operators in a query
• Consistent query evaluation is important
• if no priority requested by the user
• Some operator predicates are not commutative (involving kNN)
• kNN must be performed as first!
• Can kNN be constrained with  using some statistics?
• Equivalence rules cannot optimize everything
• Special “multi-query” operation over one database needed (to combine kNN)
• Similarity group by
• Applied as the last operator
• Can be split and pushed down – eager aggregation
Nov. 15, 2016 Seminar of DISA Lab 29
Similarity Group By
• Extended syntax of regular group by
𝐺1,𝑆1 ,… 𝐺 𝑚,𝑆 𝑛
Γ 𝐹1 𝐴1 ,…𝐹 𝑛 𝐴 𝑛
(𝑋)
𝑆1 - segmentation of domain of 𝐺1 into non-overlapping groups
𝐹1 - aggregation on 𝐴1 of data objects in a group
• Result is a set of objects with regular attributes/properties
• Procedure:
• Partition all data objects in the result into groups
• Obtain group representatives
• Compute aggregates Fi on all objects per group
• i.e. each combination of segments (values) of all Gjs
Nov. 15, 2016 Seminar of DISA Lab 30
Similarity Group By: Implementation
• Task: Define segmentation for Gi
• Using clustering
• Multiple-pass external algorithms (k-means, hierarchical clustering)
• very slow, may not be consistent
• Approximation using sampling or summaries possible (BIRCH, CURE) – one-pass
• still slow, hard to compute aggregates simultaneously, no pipelining with other operators
• Special implementation
•  control on constraint specification
•  pipelining, aggregations during group-by, no approximations (like sampling)
• Implemented in a query evaluation engine
Nov. 15, 2016 Seminar of DISA Lab 31
Similarity Group By: Implementation
• Variants:
• Supervised – segmentation is defined in advance
• E.g. cluster centers, dividing thresholds
• Unsupervised – segmentation obtained from data
• E.g. number of resulting clusters given
• Segmentation properties:
• Element separation s – each object has another object within distance s
• Group diameter d – distance between the most separated objects ≤ d
Nov. 15, 2016 Seminar of DISA Lab 32
Similarity Group By Around
• Supervised: Around – SGB-A
• Set of central points – define groups
• objects assigned to the group of the closest central point
• Element separation s and group diameter d (=2r) – optional
• Some data objects might be excluded!
Nov. 15, 2016 Seminar of DISA Lab 33
Similarity Group By Delimited
• Supervised: Delimited – SGB-D
• Set of delimiting hyperplanes
• For nD space: 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎 𝑛 𝑥 𝑛 ≤ 𝑏 and 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎 𝑛 𝑥 𝑛 > 𝑏
• Element separation s and group diameter d – optional
Nov. 15, 2016 Seminar of DISA Lab 34
Similarity Group By Unsupervised
• Unsupervised – SGB-U
• Element separation s – each object has another object within distance s
• Group diameter d – distance between the most separated objects ≤ d
Nov. 15, 2016 Seminar of DISA Lab 35
Similarity Group By Unsupervised
• Unsupervised SGB extended to vector spaces
• Element separation s – each object has another object within distance s
• Distance function d – L2 or L, but not limited to
• SGB-All
• SGB-Any
Nov. 15, 2016 Seminar of DISA Lab 36
Similarity Group By: Vector Spaces
• SGB-All
• An object o is in the group if its distance
to all objects in the group is ≤ s ()
• An object may be part of more groups, so
the on-overlap functions:
• Join-any – a group out of the matching
groups is picked at random
• Eliminate – the object is eliminated from
grouping
• New-group – all objects overlapping more
groups form a new group
• Algorithm 𝒪 𝑛 log 𝐺 for vector spaces
Nov. 15, 2016 Seminar of DISA Lab 37
=3, L2
Similarity Group By: Vector Spaces
• SGB-Any
• An object o is in the group if its distance
to at least one object in the group is ≤ s
()
• All objects can be assigned uniquely
• Algorithm 𝒪 𝑛 log 𝑛 for vector spaces
Nov. 15, 2016 Seminar of DISA Lab 38
=3, L2
Summary: SGB for 1D-nD spaces
• Efficient algorithms
• comparable to regular Group-By, faster than clustering
• Group representatives
• SGB-Around
• Given by default
• SGB-Delimited
• Order on boundaries (1D)
• SGB-U/All/Any
• Centroid of group
• Regular aggregations
• On a component of all points in a group (min/max/avg) – 1D
• On points (polygon/convex hull) – nD
Nov. 15, 2016 Seminar of DISA Lab 39
Similarity Group By: Distance Spaces
• Parameters:
• Element separation s – each object has another object within distance s
• Group diameter 2 – distance between the most separated objects ≤ 2
• Variants:
• SGB-Around
• Voronoi partitioning using pivots
• Element separation s, Group diameter 2
• Some objects might not be assigned to any group
• Some objects might belong to more groups
• SGB-Delimited
• A set of hyperplanes defined by pivots 𝑝1, 𝑝2 ; similar to Voronoi partitioning
• Any practical application?
Nov. 15, 2016 Seminar of DISA Lab 40
Similarity Group By: Distance Spaces
• Variants:
• SGB-U
• SGB-All  parameters s=, diameter=
• SGB-Any  parameters s=, diameter=
• SGB-P – permutation
• Group is formed by objects having the same k-nearest pivots
• Recursive application of SGB-Around
• Parameters separation and diameter not applicable
• Alternative: …. k-furthest pivots?
• SGB-S – subset of pivots
• “permutation” without ordering
Nov. 15, 2016 Seminar of DISA Lab 41
Similarity Group By: Distance Spaces
• SGB conflict handling when parameters (separation, diameter) given
• on overlap
• ELIMINATE
• ASSIGN_TO_ANY - might not be random, due to consistency
• ASSIGN_TO_ALL
• FORM_NEW_GROUP
• SET_NULL
• on miss
• ELIMINATE
• FORM_NEW_GROUP
• SET_NULL
• Handle missing descriptors  “NULL” group
• e.g. when more SGB attributes are combined
Nov. 15, 2016 Seminar of DISA Lab 42
Similarity Group By: Distance Spaces
• SGB with roll-up/drill-down on parameters
• Gradually extend/shrink separation/diameter
• Multiple assignment necessary
• SGB with roll-up/drill-down/dice on attributes
• Group representatives
• Cluster medoid
• Surrogate key
• Result of an aggregate function
Nov. 15, 2016 Seminar of DISA Lab 43
Similarity Group By: Distance Spaces
• Aggregate functions:
• COUNT
• MIN
• The object closest to the group representative
• MAX
• Furthest object from group representative
• AVG
• Artificial object (center)?
• MEDOID
• SUM
• A concatenation of all objects?
• SET
• A bag of all original objects
• COVER
• a set of boundary objects
• “skyline” objects (or “convex hull”)
Nov. 15, 2016 Seminar of DISA Lab 44
Future Work
• Combination of multiple predicates
• Optimized algorithms
• Similarity group by variants defined
• Introduce parameter refinement – “drill-down” feature
• Incorporate diversity in result
• Need for new aggregate functions
• Efficient algorithms for metric spaces
• Many existing rely on sorting the attribute values
• New query types
• “Multi-query” kNN
• Recursive self-join
Nov. 15, 2016 Seminar of DISA Lab 45