Similarity Management of Data the DISA Experience Pavel Zezula DISA FI Masaryk University Brno, Czech Republic Outline of the talk •Similarity in our lives •Principles of metric similarity searching –effectiveness – metric similarity features –Efficiency - •Similarity search in applications: –Searching in images of human faces –Searching for image annotation –Processing streams of similarity queries –Searching in motion capture data – Castiglione della Pescaia SEBD 2019, June 16-19 Similarity in the Real-Life quotations from the social psychology literature •Any event in the history of organism is, in a sense, unique. • •Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity. • •Similarity (proximity, resemblance, communality, representativeness, psychological distance, ...) is fundamental to theories of perception, learning, judgment, etc. • •Similarity is subjective and context-dependent Castiglione della Pescaia SEBD 2019, June 16-19 Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output009.png?1329961814 Castiglione della Pescaia SEBD 2019, June 16-19 Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output001.png?1329961814 Castiglione della Pescaia SEBD 2019, June 16-19 Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output004.png?1329961814 Castiglione della Pescaia SEBD 2019, June 16-19 Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output010.png?1329961814 Castiglione della Pescaia SEBD 2019, June 16-19 Prototypicality or Centrality not symmetric Castiglione della Pescaia SEBD 2019, June 16-19 Context/Data/Envinronment Dependent circumstances alter similarities Castiglione della Pescaia SEBD 2019, June 16-19 We learned from School •GEOMETRY: •Two polygons are similar to each other, if: 1)Their corresponding angles are congruent •∠A = ∠E; ∠B = ∠F; ∠C = ∠G; ∠D = ∠H, and 2)The lengths of their corresponding sides are proportional •AB/EF = BC/FG = CD/GH = DA/HE • B C A D E F H G SEBD 2019, June 16-19 Castiglione della Pescaia Similarity & Geometry •If one polygon is similar to a second polygon, and the second polygon is similar to the third polygon, the first polygon is also similar to the third polygon. •In any case: • •Two geometric figures are either similar or they are not similar at all • Castiglione della Pescaia SEBD 2019, June 16-19 Contemporary Networked Media The digital data view •Almost everything that we see, read, hear, write, measure, or observe can be digital. •Users autonomously contribute to production of global media and the growth is exponential. •Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events. •The elements of networked media are related by numerous multi-facet links of similarity. • • •Majority of current data is unstructured •possibly only structured on display Castiglione della Pescaia SEBD 2019, June 16-19 Challenge •Networked media database is getting close to the human “fact-bases” –the gap between physical and digital world has blurred • •Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections. • •WHY? •It is the similarity which is in the world revealing. Castiglione della Pescaia SEBD 2019, June 16-19 Iterative and Interactive Nature of Contemporary Searching • •When we search, our next actions are reactions to the stimuli of previous search results • •What we find is changing what we seek • •In any case, search must be: • • fast, simple, and relevant Castiglione della Pescaia SEBD 2019, June 16-19 Metric Space: A Geometric Model of Similarity •Metric space: M = (D,d) –D – domain –distance function d(x,y) •"x,y,z Î D •d(x,y) > 0 - non-negativity •d(x,y) = 0 Û x = y - identity •d(x,y) = d(y,x) - symmetry •d(x,y) ≤ d(x,z) + d(z,y) - triangle inequality Castiglione della Pescaia SEBD 2019, June 16-19 Examples of Distance Functions •Lp Minkovski distance (for vectors) •L1 – city-block distance » •L2 – Euclidean distance • •L¥ – infinity » •Edit distance (for strings) •minimal number of insertions, deletions and substitutions •d(‘application’, ‘applet’) = 6 » •Jaccard’s coefficient (for sets A,B) • Castiglione della Pescaia SEBD 2019, June 16-19 Examples of Distance Functions •Mahalanobis distance –for vectors with correlated dimensions •Hausdorff distance –for sets with elements related by another distance •Earth movers distance –primarily for histograms (sets of weighted features) •and many others – Castiglione della Pescaia SEBD 2019, June 16-19 Content-Based Search •Content-based search in images 1806 1040158 1045791 984761 1042473 Image base SEBD 2019, June 16-19 Castiglione della Pescaia Extracting Features •Extracting features 1806 Image level R B G Feature level SEBD 2019, June 16-19 Castiglione della Pescaia MPEG-7 •Multimedia Content Descriptors Standard ~ 2000 •Global feature descriptors: –Color, shape, texture, … – – – – –One high-dimensional vector per image and feature –Minkovski distance used • – Castiglione della Pescaia SEBD 2019, June 16-19 SEBD 2019, June 16-19 Visual Similarity - Local feature descriptors – SIFT, SURF, etc. - Invariant to image scaling, small viewpoint change, rotation, noise, illumination IMG_2088_sifts_thick Castiglione della Pescaia SEBD 2019, June 16-19 Visual Similarity - finding coresspondence Castiglione della Pescaia Biometrics: Fingerprint •Minutiae detection: –Detect ridges (endings and branching) –Represented as a sequence of minutiae •P=( (r1,e1,θ1), …, (rm,em,θm) ) •Point in polar coordinates (r,e) and direction θ •Matching of two sequences: –Align input sequence with database one –Compute weighted edit distance •wins,del=620 •wrepl=[0;26] - depending on similarity of two minutiae Castiglione della Pescaia SEBD 2019, June 16-19 fingerprint1.png fingerprint1.png Points in the sequence are ordered in an increasing order of radial angle (e). Biometrics: Hand Recognition •Hand image analysis –Contour extraction, global registration •Rotation, translation, normalization –Finger registration –Contour represented as a set of pixels F={f1,…,fNF} •Matching: modified Hausdorff distance • Castiglione della Pescaia SEBD 2019, June 16-19 hand1.png hand2.png hand3.png || f-g || = Euclidean distance between pixel positions of f and g Multiple Visual Aspects • Castiglione della Pescaia SEBD 2019, June 16-19 tovarna Current Approaches to Feature Extraction •Neural networks technology –Convolutional Neural Networks (CNN) –Recurrent Neural Networks (RNN) SEBD 2019, June 16-19 Castiglione della Pescaia Classified dataset Training data Validation data Výsledek obrázku pro neural network Training (Fine-tuning) Validation Neural network model Data split Deep Learning for Feature Extraction •Deep learning •If large amounts of data are provided, the system begins to understand them and respond in useful ways •Several architectures: –Convolutional neural networks (CNN) – – – –Recurrent neural networks (RNN) SEBD 2019, June 16-19 Castiglione della Pescaia Výsledek obrázku pro neural network cat image classification Výsledek obrázku pro recurrent neural network Convolutional Neural Networks •Convolutional neural networks (CNN) •Consist of a hierarchy of layers •Each layer transforms the data into more abstract representations (e.g., edge -> nose -> face) •The output layer combines the features to make predictions SEBD 2019, June 16-19 Castiglione della Pescaia Convolutional Neural Networks •Convolutional Neural Network (CNN) – AlexNet •The last layer with 1,000 output categories •Output of any layer can be used as a feature Výsledek obrázku pro convolutional neural network SEBD 2019, June 16-19 Castiglione della Pescaia -It contains 5 convolutional layers and 3 fully connected layers. Relu is applied after very convolutional and fully connected layer. Dropout is applied before the first and the second fully connected year. -Convolutional layer is a feature detector that automatically learns to filter out not needed information from an input by using convolutional kernel. -Pooling layers compute the max or average value of a particular feature over a region of the input data (downsizing of input images) and also help to detect objects in some unusual places and reduce memory size. -The network has 62.3 million parameters (and 650,000 neurons), and needs 1.1 billion computation units in a forward pass. We can also see convolution layers, which accounts for 6% of all the parameters, consumes 95% of the computation. It takes five to six days to train on two GTX 580 3GB GPUs. -The network uses Relu instead of Tanh to add non-linearity. It accelerates the speed by 6 times at the same accuracy. Recurrent Neural Networks •Recurrent neural networks (RNN) •RNN cells remember the inputs in internal memory, which is very suitable for sequential data •The output vector’s contents are influenced by the entire history of inputs – – SouvisejÃcà obrázek SEBD 2019, June 16-19 Castiglione della Pescaia “Whenever there is a sequence of data and that temporal dynamics that connects the data is more important than the spatial content of each individual frame.” – Lex Fridman (MIT) Recurrent Neural Networks •Recurrent neural networks (RNN) •Long-Short Term Memory (LSTM) networks: –Learn when data should be remembered and when they should be thrown away –Well-suited to learn from experience to classify, process and predict time series when there are very long time lags of unknown size between important events – – SouvisejÃcà obrázek SEBD 2019, June 16-19 Castiglione della Pescaia -Provide operations for reading, writing and resetting Deep Learning Summary • • •It is no magic! Just statistics in a black box, but exceptional effective at learning patterns, provided powerful computational infrastructure can be applied - GPU cards • • •Can be used not only for classification but is also able to provide content preserving feature vectors. •When calibrated by an Lp distance good quality similarity estimates can be obtained. SEBD 2019, June 16-19 Castiglione della Pescaia Machines that learn to represent the world from experience Similarity Search Problem •For X ÍD in metric space M, •pre-process X so that the similarity queries •are executed efficiently. • •Implementation problems: -How to partition the data to reduce search space -How to ask questions - definition of queries –- How to execute queries – to achieve performance •The challenge: –In metric space, no total ordering exists! • Castiglione della Pescaia SEBD 2019, June 16-19 Castiglione della Pescaia Basic Partitioning Principles •Given a set X Í D in M=(D,d), basic partitioning principles have been defined: – –Ball partitioning –Generalized hyper-plane partitioning • • •Note: –Some special cases, such as Euclidian or Supermetric spaces are more tractable – SEBD 2019, June 16-19 Castiglione della Pescaia Ball Partitioning •Inner set: { x Î X | d(p,x) ≤ dm } •Outer set: { x Î X | d(p,x) > dm } • p dm SEBD 2019, June 16-19 Castiglione della Pescaia Generalized Hyper-plane •{ x Î X | d(p1,x) ≤ d(p2,x) } •{ x Î X | d(p1,x) > d(p2,x) } • p2 p1 SEBD 2019, June 16-19 Castiglione della Pescaia Similarity Range Query • • • •range query –R(q,r) = { x Î X | d(q,x) ≤ r } • •… all museums up to 2km from my hotel … r q SEBD 2019, June 16-19 Castiglione della Pescaia Nearest Neighbor Query •the nearest neighbor query –NN(q) = x –x Î X, "y Î X, d(q,x) ≤ d(q,y) • •k-nearest neighbor query –k-NN(q,k) = A –A Í X, |A| = k –"x Î A, y Î X – A, d(q,x) ≤ d(q,y) • • • •… five closest museums to my hotel … q k=5 SEBD 2019, June 16-19 Castiglione della Pescaia SEBD 2019, June 16-19 Reverse Nearest Neighbor • • •… all hotels with a specific museum as a nearest cultural heritage cite … Castiglione della Pescaia SEBD 2019, June 16-19 Example of 2-RNN •Objects o4, o5, and o6 have q between their two nearest neighbor. o5 q o4 o6 o1 o2 o3 Castiglione della Pescaia SEBD 2019, June 16-19 Similarity Joins •similarity join of two data sets – – – • •similarity self join ó X = Y • •…pairs of hotels and museums •which are five minutes walk apart … m Castiglione della Pescaia SEBD 2019, June 16-19 Pivot Filtering •Idea: •Given R(q,r) use triangle inequality for pruning •All distances between objects and a pivot p are known •Prune object o Î X if any holds –d(p,o) < d(p,q) – r –d(p,o) > d(p,q) + r q r p Generalized concept of the object-pivot constraint. Castiglione della Pescaia SEBD 2019, June 16-19 Pivot Filtering (cont.) •Filtering with two pivots –Only Objects in the dark blue – region have to be checked. – –Effectiveness is improved – using more pivots. • p1 p2 q r o1 o2 o3 Castiglione della Pescaia SEBD 2019, June 16-19 Pivot Filtering (summary) •Given a metric space M=(D,d) and a set of pivots • P = { p1, p2, p3,…, pn }. We define a mapping function Y: (D,d) ® (Rn,L∞) as follows: • • Y(o) = (d(o,p1), …, d(o,pn)) • • Then, we can bound the distance d(q,o) from below: • • L∞(Y(o), Y(q)) ≤ d(q,o) The M-tree [Ciaccia, Patella, Zezula, VLDB 1997] • •1) Paged organization •2) Dynamic •3) Suitable for arbitrary metric spaces •4) I/O and CPU optimization –- computing d can be time-consuming Castiglione della Pescaia SEBD 2019, June 16-19 The M-tree Idea •Depending on the metric, the “shape” of index regions changes C D E F A B B F D E A C Metric: L2 (Euclidean) L1 (city-block) L¥ (max-metric) weighted-Euclidean quadratic form Castiglione della Pescaia SEBD 2019, June 16-19 Castiglione della Pescaia SEBD 2019, June 16-19 o7 M-tree: Example o1 o6 o10 o3 o2 o5 o4 o9 o8 o11 o1 4.5 -.- o2 6.9 -.- o1 1.4 0.0 o10 1.2 3.3 o7 1.3 3.8 o2 2.9 0.0 o4 1.6 5.3 o2 0.0 o8 2.9 o1 0.0 o6 1.4 o10 0.0 o3 1.2 o7 0.0 o5 1.3 o11 1.0 o4 0.0 o9 1.6 Covering radius Distance to parent Distance to parent Distance to parent Distance to parent Leaf entries Similarity Search: Part II, Chapter 3 48 M-tree: Range Search •Given R(q,r): •Traverse the tree in a depth-first manner •In an internal node, for each entry áp,rc,d(p,pp),ptrñ –Prune the subtree if |d(q,pp) – d(p,pp)| – rc > r –Application of the pivot-pivot constraint q q r p rc pp r p rc pp Similarity Search: Part II, Chapter 3 49 M-tree: Range Search (cont.) •If not discarded, compute d(q,p) and –Prune the subtree if d(q,p) – rc > r –Application of the range-pivot constraint – – – – – •All non-pruned entries are searched recursively. q p rc r D-Index [Dohnal, Gennaro, Zezula, MTA 2002] 4 separable buckets at the first level 2 separable buckets at the second level exclusion bucket of the whole structure Castiglione della Pescaia SEBD 2019, June 16-19 D-index: Insertion Castiglione della Pescaia SEBD 2019, June 16-19 D-index: Range Search q r q r q r q r q r q r Castiglione della Pescaia SEBD 2019, June 16-19 Pivot Permutation Approach •Assume a set n of pivots {p1, p2, ... pn} • •Given object x in X, order the pivots according to d(x, pi) • •Let ∏x be a permutation on the set of pivot indexes {1, … , n} such that ∏x(j) is index of the j-th closest pivot from x –e.g., ∏x(1) is the index of the closest pivot to x –p∏x(j) is the j-th closets pivot from x – •∏x is denoted as Pivot Permutation (PP) with respect to x. Castiglione della Pescaia SEBD 2019, June 16-19 Pivot Permutation Approach Castiglione della Pescaia •Can be seen as a recursive Voronoi partitioning to level l • •Cell C(i1, … il) contains objects x for which: • •∏x(1) = i1, ∏x (2) = i2, … , ∏x (l)=il SEBD 2019, June 16-19 Metric Sketches for Fast Searching and Filtering •Transformation of any metric space (D,d) to the Hamming space –Each descriptor o is transformed into a bit-string sketch of length λ. Hamming distance evaluates the number of different bits – very efficient, hardware instruction •Sketches compared by the Hamming distance should approximate similarity relationships of the original metric •Typical sketch length is 32 – 320 bits Castiglione della Pescaia SEBD 2019, June 16-19 Sketching Transformation •GHP to set one bit of sk(o) • •GHP to set bits μ and ν of sk(o) • Castiglione della Pescaia SEBD 2019, June 16-19 Properties and Application • •Good sketches (choosing pivots): –Balanced split –Low correlation •Application – even better for filtering than searching Castiglione della Pescaia SEBD 2019, June 16-19 Textbooks on Metric Searching technology book Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics) Hanan Samet Foundation of Multidimensional and Metric Data Structures Morgan Kaufmann, 2006 P. Zezula, G. Amato, V. Dohnal, and M. Batko Similarity Search: The Metric Space Approach Springer, 2005 Teaching material: http://www.nmis.isti.cnr.it/amato/similarity-search-book/ Castiglione della Pescaia SEBD 2019, June 16-19 [USEMAP] Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework Metric space (D,d) Operations Storage Centralized index structures Distributed index structures Communication Net Vectors • Lp and quadratic form Strings • (weighted) edit and protein sequence Insert, delete, range query, k-NN query, Incremental k-NN Volatile memory Persistent memory Performance statistics Castiglione della Pescaia SEBD 2019, June 16-19 Similarity Search Demos •20M images: http://disa.fi.muni.cz/demos/profiset-decaf/ •Fashion: http://disa.fi.muni.cz/twenga/ •Image annotation: http://disa.fi.muni.cz/annotation/ • •Fingerprints: http://disa.fi.muni.cz/fingerprints/ •Time series: http://disa.fi.muni.cz/subseq/ •Multi-modal person ident.: http://disa.fi.muni.cz/mmpi Castiglione della Pescaia SEBD 2019, June 16-19 http://disa.fi.muni.cz/FaceMatch/image/215920 http://disa.fi.muni.cz/FaceMatch/image/215920 Preprocessing Retrieval http://disa.fi.muni.cz/FaceMatch/image/215920 Face detection with several technologies Merge of detected faces <13.9, 9.5, -6.0, 712.1, …> <17.9, 12.1, -9.1, 692.0, …> <8.8, 7.7, -3.5, 570.8, …> <14.4, 8.2, -8.4, 704.0, …> <10.1, 5.8, 40.6, 99.6, …> <5.4, 1.2, -60.4, 88.0, …> <45.1, 64.8, 90.6, 78.6, …> Face description with several technologies Similarity Search in Collections of Faces http://upload.wikimedia.org/wikipedia/commons/8/88/Cameron_Diaz_WE_2012_Shankbone_3.JPG <13.9, 9.5, -6.0, 712.1, …> <10.6, 78.9, -45.6, 101.3, …> 0.12 0.17 0.18 0.23 Fused features DB Features indexing by one technique > Face Detection Features Extraction Candidates filtering Index Fused features saving Candidate faces Query image Castiglione della Pescaia SEBD 2019, June 16-19 Fused Face Detection and Face Matching •Fused face detection: •Faces detected by more technologies are taken into account •Showcase: 3 technologies, compliance of at least two: • •Fused face matching: •Characteristic features from more technologies are available for each face •Similarity of two faces evaluated by each technology is normalized into interval [0, 1] •Normalized value expresses a probability that faces belong to the same person •Highest probability is used to determine the similarity of faces Software name OpenCV Luxand Verilook Compliance of at least 2 Recall / precision (%) 55 / 89 64 / 83 73 / 83 64 / 96 Castiglione della Pescaia SEBD 2019, June 16-19 Face Matching Results, Relevance Feedback •User may improve results by marking correctly found faces in several iterations: 00788_941205_qr00.jpg 00816_960530_rb00.jpg 00825_940307_fa00.jpg 00825_940307_fb00.jpg 00825_940307_hl00.jpg 00825_940307_pl00.jpg 00846_940307_hr_a00.jpg 00717_960530_fb00.jpg query 00788_941205_qr00.jpg 00816_960530_rb00.jpg 00788_941205_qr00.jpg 00825_940307_fa00.jpg 00825_940307_fb00.jpg 00825_940307_hl00.jpg 00825_940307_hr00.jpg 00825_940307_pl00.jpg 00846_940307_hr_a00.jpg 00717_960530_fb00.jpg query 00825_940307_hl00.jpg 00825_940307_pl00.jpg 00825_940307_hl00.jpg 00825_940307_hr00.jpg 00825_940307_hl00.jpg 00846_940307_hr_a00.jpg 2nd iteration 1st iteration Castiglione della Pescaia SEBD 2019, June 16-19 Slide ‹#› Search-based Image Annotation § §We already have a strong tool – the similarity search §For any input image, we can retrieve visually similar images §Metadata of the similar images can be used to describe the original image § https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRSpxVqEaSo5HgnpQLjkP44gtENLw8eHm8jhsO6LnM98BS eIzNt §Keyword-based image retrieval §Popular and intuitive §Needs pictures with text metadata §Manual annotation is expensive ? Need for automatic image annotation Search-based image annotation Slide ‹#› Search-based annotation principles https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRSpxVqEaSo5HgnpQLjkP44gtENLw8eHm8jhsO6LnM98BS eIzNt Annotated image collection Content-based image retrieval Similar annotated images Yellow, bloom, pretty Meadow, outdoors, dandelion Mary’s garden, summer Candidate keyword processing Semantic resources Final candidate keywords with probabilities Plant 0.3 Flower 0.3 Garden 0.15 Sun 0.05 Human 0.1 Park 0.1 d = 0.2 d = 0.6 d = 0.5 ? Slide ‹#› Content-based retrieval for annotations §What we need: § §Large collection of reliably annotated images: Profiset §20 million general-purpose photos from the Profimedia photostock company §Descriptive keywords for each photo provided by authors who want to sell the pictures → rich and reliable annotations §Efficient and effective search: DeCAF descriptors and PPP-codes §DeCAF: 4096-dimensional vector obtained from the last layer of a neural network image classifier §PPP-codes: effective permutation-based metric space indexing method § § § § § § § § Profiset keywords: botany, close, closeup, color, daytime, detail, exterior, flower, germany, hepatica, horticulture, laughingstock, liverwort, lobed, mecklenburg, nature, nobilis, outdoor, outside, plant, pomerania, purple, round, western http://disa.fi.muni.cz/profimedia/imagesJpeg/0009008905 Slide ‹#› ConceptRank §Candidate keyword analysis inspired by Google PageRank §Uses semantic connections between candidate keywords to determine the probability of individual candidates §Main steps: §Construct a graph of candidate keywords related by WordNet semantic links §New candidates can be found during the WordNet exploration §Apply biased random walk with restarts to compute the score of each keyword §Keyword scores from the content-based search are included via the biased restart Similar annotated images Yellow, bloom, pretty Meadow, outdoors, dandelion Mary’s garden, summer d = 0.2 d = 0.6 d = 0.5 Slide ‹#› Example http://disa.fi.muni.cz/profimedia/imagesJpeg/0077128591 Candidate keywords after CBIR church, architecture, travel, europe, building, religion, germany, buildings, north, churches, christianity, america, religious, exterior, st, historic, world, tourism, united, usa, … 1.Retrieve 100 similar images from Profiset 2.Merge their keywords, compute frequencies 3.Build the semantic network using WordNet 4.Compute the ConceptRank 5.Apply postprocessing & return 20 most probable keywords ConceptRank scores building (2.53), structure (2.41), LANDSCAPE (2.10), BUILDINGS (1.87), OBJECT (1.84), NATURE (1.78), place_of_worship (1.75), church (1.74), Europe (1.68), religion (1.64), continent (1.51), … Final keywords building, structure, church, religion, continent, group, travel, island, sky, architecture, tower, person, belief, locations, chapel, christianity, tourism, regions, country, district Semantic network 4 relationships: hypernym (dog → animal), hyponym (animal → dog), meronym (leaf → tree), holonym (tree → leaf) 270 network nodes, 471 edges Slide ‹#› Annotations in use §Participation in the ImageCLEF 2014 Scalable Annotation Challenge §2nd place, mean average precision of annotation approx. 60 % § §Web demo & Mozilla addon §http://disa.fi.muni.cz/prototype-applications/image-annotation § § § § § § § § § § § Similarity Search in Streams ▪ ▪Two basic approaches to explore data: ▪Store, pre-process and search later, database processing ▪Process (filter) continuously, stream processing ▪ ▪Examples of stream processing applications: ▪Surveillance camera and event detection ▪Mail stream and spam filter ▪Publish/subscribe applications Castiglione della Pescaia SEBD 2019, June 16-19 Stream Processing Scenarios ▪Stream: potentially infinite sequence of data items (d1, d2, …) – tuples, images, frames, etc. ▪Basic scenarios: ▪Data items processed immediately, possible data item skipping → minimize delay - e.g., event detection ▪Process everything as fast as possible, delay possible to maximize throughput - our focus ▪Motivating examples with similarity searching ▪Image annotation – annotate a stream of images collected by a web crawler ▪Publish/subscribe applications – categorize a stream of documents by similarity searching Castiglione della Pescaia SEBD 2019, June 16-19 Processing Streams of Query Objects ▪Typical large-scale similarity search approach: ▪partitioned data stored on a disk ▪partition reads from a disk form the bottleneck ▪Idea: similar queries need similar sets of partitions → save accesses ▪Buffer: memory used for reordering (clustering) queries ▪Cache: memory containing previously read data partitions Disk Buffer Query Cache Query Stream Result Metric index Castiglione della Pescaia SEBD 2019, June 16-19 Experiment Results ▪100,000 processed 10-NN queries ▪DB: 1 mil. MPEG-7 descriptors ▪Buffer capacity: 8,000 queries ▪Cache size: 40,000 objects (4% of the DB) ▪100,000 processed 10-NN queries ▪DB: 10 mil. MPEG-7 descriptors ▪Buffer capacity: 10,000 queries ▪Cache size: 90,000 objects (0.9% of the DB) Castiglione della Pescaia SEBD 2019, June 16-19 Continuous kNN Join ▪Real-time recommendation ▪Set of user profiles ▪Stream of published data ▪Recommend published data to the users 74 Stream Time window Past Future User profiles Recommended items Content-Based Recommendation ▪Metric space M = (D, d) ▪Domain of objects D (often high-dimensional vectors) ▪Distance function d: D x D → R determines the similarity of two objects (e.g., Euclidean distance) ▪Content-based recommendation ▪Represent stream objects and users‘ preferences in the same domain D ▪Recommend each user with their k nearest stream objects, i.e., kNN join 75 Users‘ preferences (query objects) Recommended stream objects Continuous kNN Join 76 Time window Past Future Users‘ preferences (query objects) 3NN join ▪Task: efficiently evaluate kNN join at any time ▪Challenge: efficient join computation having a lot of query objects and a lot of stream objects in the time window ▪General approach: update the join result continuously ▪Problem: a lot of time-consuming metric distance computations ▪Approximation technique proposal ▪Sketch-based filter ▪Hamming distances Continuous kNN Join 77 Time window Past Future Users‘ preferences (query objects) 3NN join ▪Task: efficiently evaluate kNN join at any time ▪Challenge: efficient join computation having a lot of query objects and a lot of stream objects in the time window ▪General approach: update the join result continuously ▪Problem: a lot of time-consuming metric distance computations ▪Approximation technique proposal ▪Sketch-based filter ▪Hamming distances ? Continuous kNN Join 78 Time window Past Future Users‘ preferences (query objects) 3NN join ▪Task: efficiently evaluate kNN join at any time ▪Challenge: efficient join computation having a lot of query objects and a lot of stream objects in the time window ▪General approach: update the join result continuously ▪Problem: a lot of time-consuming metric distance computations ▪Approximation technique proposal ▪Sketch-based filter ▪Hamming distances Experiments ▪Dataset: Caffe descriptors (4,096 dimensional vectors) ▪1 step: process a new stream object, delete the oldest stream object ▪Each experiment: 1,000 steps ▪10-NN join ▪200,000 stream objects in the time window ▪Sketch-based approach vs. no sketches: ▪sketches nearly 13 times faster, almost precise results (less than 0.5% missed objects) 79 Continuous Time-Dependent KNN Join ▪Similarity decreases over time ▪Figure: stream objects shrink with increasing age ▪Proposed algorithms for continuous update of the join 80 Stream objects Users‘ preferences (query objects) Recommended stream objects ordered by descending similarity Time Time1 Time2 Similarity Search in Motion Capture (Mocap) Data Digital representations of human motions, recorded by motion capturing devices for further use in a variety of applications. Castiglione della Pescaia SEBD 2019, JUNE 16-19 What Is Mocap •Digital representation is depicted by series of coordinates of body joints in space-time. •Complex multi-dimensional spatio-temporal data (3D space, 31 joints, 120 frames per second). •Visualized by simplified human skeleton (stick figure), coordinates of joints stored as float numbers. •1 minute of such motion data ≈ 669,600 float numbers. http://cg.cs.uni-bonn.de/aigaion2root/attachments/keyframeSynthesis_web.jpg Castiglione della Pescaia SEBD 2019, JUNE 16-19 The Need for a Similarity Measure ◦Almost every application of Mocap data (analysis, searching, action recognition, detection, synthesis, clustering) requires a pair-wise action comparison based on similarity. ◦The challenge: •Develop a measure for content-based similarity ◦ comparison of Mocap data. ? Castiglione della Pescaia SEBD 2019, JUNE 16-19 Motion Similarity Problems The same action can be performed differently •by different actors, •in various styles, •in various speed, •or start at different body configurations. ◦ ◦Similarity of motions is application-dependent •e.g., general action recognition vs. ◦ person-identification •there is no universal similarity model Castiglione della Pescaia SEBD 2019, JUNE 16-19 Comparing Similarity in Motions General Overview 1. DATA REPRESENTATION absolute coordinates, relative distances, joint rotation angles or velocities (quantization or dimensionality reduction might be applied) Machine Learning • Convolutional Neural Networks, Boltzmann M. • Support Vector Machines Special structures • Motion and Action graphs • Temporal pyramids • Hidden Markov models Distance-Based functions • Dynamic Time Warping • k-NN + L2 2. WAYS OF COMPARISON + Castiglione della Pescaia SEBD 2019, JUNE 16-19 https://docs.google.com/document/d/130MFzYC-hVFOCZhe58H7kcKkYuDPJp5p-D9gvWE8eU0/edit DTW is not metric – but approximations exists for indexing Comparing Similarity in Motions Examples Joint positions features + Euclidean Distance (Krüger 2010) Fisher Vector + SVM classifier (Evangelidis 2014) Time Series + Dynamic Time Warping (Müller 2009) Castiglione della Pescaia SEBD 2019, JUNE 16-19 https://docs.google.com/document/d/130MFzYC-hVFOCZhe58H7kcKkYuDPJp5p-D9gvWE8eU0/edit Our Approach – Motion Images Every single-frame joint configuration is normalized (by centering and rotating), then transformed into a RGB stripe image while fully preserving skeleton configuration. Castiglione della Pescaia SEBD 2019, JUNE 16-19 We focus on body joints, not volumetric model nor skeleton bones Why normalziation is important Motion Images + Caffe 1) Effective transformation from (dynamic) motion capture data into (static) images. 2) Extract fixed-size feature vector using content-based image descriptors. 3) Index for fast and scalable search 1) Motion Motion Image (hop-one-leg) 2) <0, 0, 4.56, 0, 7.88, …> Nearest neighbors 3) d = 13.17 d = 10.1 d = 14.2 d = 1.1 4096-dim Caffe vector Castiglione della Pescaia SEBD 2019, JUNE 16-19 Similarity model: Caffe + L2 Caffe Descriptors Extracted using Convolutional Neural Network trained on 1.3M photographs Network can be fine-tuned to the domain of motion images Output of 7th layer is a 4096-dimensional vector transform Motion Images extract 4096-dimensional features Caffe descriptors are fixed sized vectors extracted from motion images. They are compared for similarity by L2 distance function. Castiglione della Pescaia SEBD 2019, JUNE 16-19 Motion Images – Properties •Pattern recognition is a mature concept nowadays many highly accurate computer vision techniques might be employed. •The proposed similarity measure is robust and tolerant towards inferior data quality, execution speed and imprecise segmentation. • •Fixed-size feature vectors can be indexed in large scale ◦ evaluate a query in one year long Mocap data in less than a second • •Fixed-size feature vectors compress the original data Castiglione della Pescaia SEBD 2019, JUNE 16-19 Sizes Compared • 5 seconds of mocap •3 x 31 x 120 x 5 = 57 600 floats •≈ 460 KB • 1 image 256 x 256 px in png format •≈ 5-10 KB •1 caffe descriptor •4096 floats ≈ 32 KB •4096 bits ≈ 1 KB • 1 mpeg7 descriptor •256 floats ≈ 2 KB • • Castiglione della Pescaia SEBD 2019, JUNE 16-19 SISAP – Similarity Search and Applications International conference series (http://sisap.org/) 2008 Cancun Mexico Výsledek obrázku pro cancun 2012 Toronto Canada 2016 Tokyo Japan 2018 Lima Peru 2009 Prague Czechia 2010 Instanbul Turkey 2011 Lipari Italy 2013 A Coruña Spain 2017 Munich Germany 2015 Glasgow UK 2014 Los Cabos Mexico Výsledek obrázku pro prague Výsledek obrázku pro istanbul Výsledek obrázku pro lipari SouvisejÃcà obrázek Výsledek obrázku pro la coruna Výsledek obrázku pro los cabos Výsledek obrázku pro glasgow Výsledek obrázku pro tokyo fudzi Výsledek obrázku pro munich Výsledek obrázku pro lima peru SEBD 2019, June 16-19 Castiglione della Pescaia Laboratory of Data Intensive Systems and Applications disa.fi.muni.cz Castiglione della Pescaia SEBD 2019, JUNE 16-19