Enhancing Similarity Search
Throughput by Dynamic Query
Reordering
Filip Nálepa, Michal Batko, Pavel Zezula
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Big Data Processing
▪ Large amount of data produced every second
▪ Need to process the data
▪ Two basic approaches:
▪ Store and process later, i.e., database processing
▪ Process continuously, i.e., stream processing
▪ Examples of stream processing applications:
▪ Surveillance camera stream and event detection
▪ Mail stream and spam filter
▪ Publish/subscribe applications
2
Stream Processing Scenarios
▪ Stream: potentially infinite sequence of data items (d1, d2, …)
▪ Basic scenarios:
▪ Data items processed immediately, possible data item skipping
→ minimize delay
▪ E.g., event detection in surveillance camera stream
▪ Process everything as fast as possible, data item can be delayed
→ maximize throughput
▪ That‘s our focus
▪ Motivating examples in similarity search
▪ Image annotation – annotate a stream of images collected by a
web crawler
▪ Publish/subscribe applications – categorize a stream of documents
▪ → stream of query objects
3
Problem Definition
▪ Domain of objects D
▪ DB of objects D indexed in the metric space
▪ Distance function d: D x D → R determines the similarity of two
objects
▪ Stream of query objects ((q1, t1), (q2, t2), …)
▪ qi ∈ D
▪ ti – time of arrival, ti ≤ ti+1
▪ Evaluate k-NN query for each qi, i.e., find k most similar
objects in DB to qi
▪ Optimization criteria – throughput
▪ Maximize the number of processed query objects
4
Architecture
▪ Typical similarity search techniques:
▪ Partitioned data of DB stored on a disk
▪ Read a subset of partitions during query evaluation → bottleneck
▪ Idea: similar query objects need similar sets of partitions → save disk accesses
▪ Buffer: waiting query objects, query object reordering
▪ Metric index: query evaluation
▪ Cache: in-memory caching of data partitions
5
Disk
Buffer
Query Cache
Query
Stream
Result
Metric index
Cache
▪ Generic metric index
▪ Data partitioning P = {p1, …, pn} where pi ⊆ D
▪ I(q) ⊆ P; partitions accessed during evaluation of q
▪ Partitions caching
▪ cache = {p1, …, pm} ⊆ P
▪ Cache utility cu =
|𝐼 𝑞 ∩ 𝑐𝑎𝑐ℎ𝑒|
|𝐼 𝑞 |
▪ Time to process a given query: queryTime(cu)
▪ Assumption: cu1 ≤ cu2 → queryTime(cu1) ≥ queryTime(cu2)
6
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
%ofquerytime
comparedtonocache
Cache utility (in %)
Buffer – Query Ordering
▪ Simplified buffer representation as an undirected complete graph G
▪ Vertices = query objects in the buffer
▪ Value of edge |pq| = time to process q after p (depends on the cache utility)
▪ Query ordering = path in G
▪ Throughput maximization: shortest path in G
▪ How to find a short path?
▪ How to construct the graph?
7
4
4
5
3
9
8
q1
q2
q3
q4
How to Find a Short Path?
▪ Shortest path search – NP-hard problem (travelling salesman)
▪ Added difficulty: new vertices added dynamically as new query
objects arrive to the buffer
▪ Heuristics: find a dense cluster and evaluate queries in the cluster
8
How to Construct the Graph and the Clusters?
▪ Requirements: efficient, support for graph evolution
▪ Approach: estimate the edge values (query times) by metric distances
▪ Low metric distance → high cache utility → low query time
▪ Computing all metric distances: time consuming
▪ → Pivot-based clustering
▪ Fixed set of pivots p1, …, pn in the metric space
▪ Compute metric distance of a new query object to all the pivots
▪ Order the pivots from the nearest to the farthest one → pivot permutation
= cluster
9
p1
p2
p3
(p2, p3, p1)
(p1, p3, p2)
(p2, p1, p3)
Experiments – Fixed Buffer Size
▪ DB: 10 mil. images represented by MPEG-7 descriptors
▪ Stream of query objects: evaluation of approximate 10-NN queries
▪ Cache size: 90,000 objects (0.9% of the DB)
▪ Fixed buffer size: 1 query object added per 1 processed query
▪ Baseline: no reordering, no caching
10
0%
10%
20%
30%
40%
50%
0 10000 20000 30000 40000
Cacheutility
Buffer size
Cache utility
0
1
2
3
0 10000 20000 30000 40000
Throughputspedup
comparedtonooptimization
Buffer size
Speedup
Experiments – Fixed Input Rate
▪ DB: 10 mil. images represented by MPEG-7 descriptors
▪ Stream of query objects: evaluation of approximate 10-NN queries
(10 nearest neighbors)
▪ Cache size: 90,000 objects (0.9% of the DB)
▪ Fixed input rate: new query object arrives every x time units
▪ Average query time for no reordering and no caching: 113 ms
11
0
20
40
60
80
100
120
0 5000 10000
Buffersize(thousands)
Elapsed time [s]
20 ms 30 ms 60 ms
20 ms 30 ms 60 ms
Max
delay [s]
4031 2988 1565
Median
delay [s]
1525 894 234
Cache
utility
0.78 0.59 0.30
Delay: time since query object
arrived until it was processed
Summary
▪ Stream of similarity query objects
▪ Enhancing the throughput by query reordering and data partition caching
12
Disk
Buffer
Query Cache
Query
Stream
Result
Metric index