Similarity Searching for Database Applications Pavel Zezula Masaryk University Brno, Czech Republic Outline of the talk •On the importance of similarity and searching •Principles of metric similarity searching •Similarity search applications: –Searching in images of human faces –Searching for image annotation –Stream processing –Searching in motion capture data – Pragu, August 2016 2 20th East-European Conference on Advances in Databases and Information Systems Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output009.png?1329961814 Pragu, August 2016 3 20th East-European Conference on Advances in Databases and Information Systems Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output001.png?1329961814 Pragu, August 2016 4 20th East-European Conference on Advances in Databases and Information Systems Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output004.png?1329961814 Pragu, August 2016 5 20th East-European Conference on Advances in Databases and Information Systems Real-life Similarity •Are they similar? http://html.slidesharecdn.com/similarity-120222193728-phpapp02/output010.png?1329961814 Pragu, August 2016 6 20th East-European Conference on Advances in Databases and Information Systems Real-Life Motivation The social psychology view •Any event in the history of organism is, in a sense, unique. •Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity. •Similarity (proximity, resemblance, communality, representativeness, psychological distance, etc.) is fundamental to theories of perception, learning, judgment, etc. •Similarity is subjective a context-dependent Pragu, August 2016 7 20th East-European Conference on Advances in Databases and Information Systems Contemporary Networked Media The digital data view •Almost everything that we see, read, hear, write, measure, or observe can be digital. • •Users autonomously contribute to production of global media and the growth is exponential. • •Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events. • •The elements of networked media are related by numerous multi-facet links of similarity. Pragu, August 2016 8 20th East-European Conference on Advances in Databases and Information Systems Challenge •Networked media database is getting close to the human “fact-bases” –the gap between physical and digital world has blurred • •Similarity data management is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections. • •WHY? •It is the similarity which is in the world revealing. Pragu, August 2016 9 20th East-European Conference on Advances in Databases and Information Systems Similarity and the Big Data •Loads on a sharp rise – usage on decline •The (3V) problem of: Volume, Variety, Velocity •Issues: –Acquisition: what to keep and what to discard –Unstructured data: what content to extract –Datafication: render into data many new aspects –Inaccuracy: approximation, imprecision, noise • Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 10 The Big Data problem •Shifts in thinking: –from some to all (scalability) –from clean to messy (approximate) •Technological obstacles: heterogeneity, scale, timeliness, complexity, and privacy aspects •Foundational challenges: scalable and secure data analysis, organization, retrieval, and modeling • Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 11 Search – the goals 1.We search to get results (papers, books, …) 2.We ask to find answers (what time … ) 3.We use filters so that the right staff finds us 4.We browse while wandering and way-finding in typically restricted space • •In reality, we move fluidly between modes of • ask, browse, filter, and search Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 12 Search – some quantitative facts •85% of all web traffic comes from search engines •450+ million searches/day are performed in North America alone •70%+ of all searches are done on Google sites • •Search is the most popular application •(second to E-mail??) Pragu, August 2016 13 20th East-European Conference on Advances in Databases and Information Systems Search – some experience • •60% of searchers NEVER go past 1st page of search results •The top three results draw 80% of the attention •The first few results inordinately influence query reformulation. • Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 14 Search - as an interaction • •When we search, our next actions are reactions to the stimuli of previous search results • •What we find is changing what we seek • •In any case, search must be: • • fast, simple, and relevant Pragu, August 2016 15 20th East-European Conference on Advances in Databases and Information Systems Search – changes our cognitive habits 1.We are increasingly handing off the job of remembering to search engines 2.When we expect information to be easily found again, we do not remember it well 3.Our original memory of facts is changing to a memory of ways to find the facts • Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 16 20th East-European Conference on Advances in Databases and Information Systems 17 State of the art in Metric Searching technology book Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics) Hanan Samet Foundation of Multidimensional and Metric Data Structures Morgan Kaufmann, 2006 P. Zezula, G. Amato, V. Dohnal, and M. Batko Similarity Search: The Metric Space Approach Springer, 2005 Teaching material: http://www.nmis.isti.cnr.it/amato/similarity-search-book/ Pragu, August 2016 Similarity Search Conferences • Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 18 E:\sisap-header.png Tokyo SISAP 2016 9th SISAP 2016, October 24-26, Tokyo, Japan The MUFIN Approach •MUFIN: MUlti-Feature Indexing Network SEARCH infrastructure Scalability P2P structure Extensibility metric space Independence Infrastructure as a service Pragu, August 2016 19 20th East-European Conference on Advances in Databases and Information Systems Extensibility: Metric Abstraction of Similarity •Metric space: M = (D,d) –D – domain –distance function d(x,y) •"x,y,z Î D •d(x,y) > 0 - non-negativity •d(x,y) = 0 Û x = y - identity •d(x,y) = d(y,x) - symmetry •d(x,y) ≤ d(x,z) + d(z,y) - triangle inequality Pragu, August 2016 20 20th East-European Conference on Advances in Databases and Information Systems Examples of Distance Functions •Lp Minkovski distance (for vectors) •L1 – city-block distance » •L2 – Euclidean distance • •L¥ – infinity » •Edit distance (for strings) •minimal number of insertions, deletions and substitutions •d(‘application’, ‘applet’) = 6 » •Jaccard’s coefficient (for sets A,B) • Pragu, August 2016 21 20th East-European Conference on Advances in Databases and Information Systems Examples of Distance Functions •Mahalanobis distance –for vectors with correlated dimensions •Hausdorff distance –for sets with elements related by another distance •Earth movers distance –primarily for histograms (sets of weighted features) •and many others – Pragu, August 2016 22 20th East-European Conference on Advances in Databases and Information Systems Similarity Search Problem •For X ÍD in metric space M, • pre-process X so that the similarity queries • are executed efficiently. • • –In metric space: no total ordering exists! • Pragu, August 2016 23 20th East-European Conference on Advances in Databases and Information Systems Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 24 Basic Partitioning Principles •Given a set X Í D in M=(D,d), basic partitioning principles have been defined: – –Ball partitioning –Generalized hyper-plane partitioning –Excluded middle partitioning – Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 25 Ball Partitioning •Inner set: { x Î X | d(p,x) ≤ dm } •Outer set: { x Î X | d(p,x) > dm } • p dm Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 26 Generalized Hyper-plane •{ x Î X | d(p1,x) ≤ d(p2,x) } •{ x Î X | d(p1,x) > d(p2,x) } • p2 p1 Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 27 Similarity Range Query • • • •range query –R(q,r) = { x Î X | d(q,x) ≤ r } • •… all museums up to 2km from my hotel … r q Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 28 Nearest Neighbor Query •the nearest neighbor query –NN(q) = x –x Î X, "y Î X, d(q,x) ≤ d(q,y) • •k-nearest neighbor query –k-NN(q,k) = A –A Í X, |A| = k –"x Î A, y Î X – A, d(q,x) ≤ d(q,y) • • • •… five closest museums to my hotel … q k=5 Scalability: Peer-to-Peer Indexing •Local search: Main memory structures •Native metric techniques: GHT*, VPT* •Transformation techniques: M-CAN, M-Chord system-schema Pragu, August 2016 29 20th East-European Conference on Advances in Databases and Information Systems [USEMAP] Infrastructure Independence: MESSIF Metric Similarity Search Implementation Framework Metric space (D,d) Operations Storage Centralized index structures Distributed index structures Communication Net Vectors • Lp and quadratic form Strings • (weighted) edit and protein sequence Insert, delete, range query, k-NN query, Incremental k-NN Volatile memory Persistent memory Performance statistics Pragu, August 2016 30 20th East-European Conference on Advances in Databases and Information Systems MUFIN demos •http://disa.fi.muni.cz/imgsearch/similar •http://www.pixmac.com/ •http://disa.fi.muni.cz/twenga/ •http://disa.fi.muni.cz/fingerprints/ •http://disa.fi.muni.cz/subseq/ •http://disa.fi.muni.cz/FaceMatch/ •http://disa.fi.muni.cz/annotation/ •http://disa.fi.muni.cz/motion-match/ •http://disa.fi.muni.cz/profimedia-neural_network-20M/ • Pragu, August 2016 31 20th East-European Conference on Advances in Databases and Information Systems http://disa.fi.muni.cz/FaceMatch/image/215920 http://disa.fi.muni.cz/FaceMatch/image/215920 • • • • • • • • • • • • •Preprocessing •Retrieval http://disa.fi.muni.cz/FaceMatch/image/215920 Face detection with several technologies Merge of detected faces •<13.9, 9.5, -6.0, 712.1, …> <17.9, 12.1, -9.1, 692.0, …> •<8.8, 7.7, -3.5, 570.8, …> • •<14.4, 8.2, -8.4, 704.0, …> <10.1, 5.8, 40.6, 99.6, …> •<5.4, 1.2, -60.4, 88.0, …> <45.1, 64.8, 90.6, 78.6, …> Face description with several technologies Similarity Search in Collections of Faces http://upload.wikimedia.org/wikipedia/commons/8/88/Cameron_Diaz_WE_2012_Shankbone_3.JPG • •<13.9, 9.5, -6.0, 712.1, …> <10.6, 78.9, -45.6, 101.3, …> •0.12 •0.17 •0.18 •0.23 Fused features DB Features indexing by one technique > Face Detection Features Extraction •Candidates filtering Index Fused features saving •Candidate faces •Query image Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 32 Fused Face Detection and Face Matching •Fused face detection: •Faces detected by more technologies are taken into account •Showcase: 3 technologies, compliance of at least two: • •Fused face matching: •Characteristic features from more technologies are available for each face •Similarity of two faces evaluated by each technology is normalized into interval [0, 1] •Normalized value expresses a probability that faces belong to the same person •Highest probability is used to determine the similarity of faces Software name OpenCV Luxand Verilook Compliance of at least 2 Recall / precision (%) 55 / 89 64 / 83 73 / 83 64 / 96 Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 33 Face Matching Results, Relevance Feedback • •User may improve results by marking correctly found faces in several iterations: 00788_941205_qr00.jpg 00816_960530_rb00.jpg 00825_940307_fa00.jpg 00825_940307_fb00.jpg 00825_940307_hl00.jpg 00825_940307_pl00.jpg 00846_940307_hr_a00.jpg 00717_960530_fb00.jpg • •query 00788_941205_qr00.jpg 00816_960530_rb00.jpg 00788_941205_qr00.jpg 00825_940307_fa00.jpg 00825_940307_fb00.jpg 00825_940307_hl00.jpg 00825_940307_hr00.jpg 00825_940307_pl00.jpg 00846_940307_hr_a00.jpg 00717_960530_fb00.jpg • •query 00825_940307_hl00.jpg 00825_940307_pl00.jpg 00825_940307_hl00.jpg 00825_940307_hr00.jpg 00825_940307_hl00.jpg 00846_940307_hr_a00.jpg •2nd iteration •1st iteration Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems 34 Slide ‹#› Search-based Image Annotation § §We already have a strong tool – the similarity search §For any input image, we can retrieve visually similar images §Metadata of the similar images can be used to describe the original image § https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRSpxVqEaSo5HgnpQLjkP44gtENLw8eHm8jhsO6LnM98BS eIzNt §Keyword-based image retrieval §Popular and intuitive §Needs pictures with text metadata §Manual annotation is expensive ? Need for automatic image annotation Search-based image annotation Slide ‹#› Search-based annotation principles https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRSpxVqEaSo5HgnpQLjkP44gtENLw8eHm8jhsO6LnM98BS eIzNt •Annotated image collection •Content-based •image retrieval Similar annotated images •Yellow, bloom, pretty •Meadow, outdoors, dandelion •Mary’s garden, summer •Candidate •keyword •processing •Semantic resources Final candidate keywords with probabilities •Plant 0.3 •Flower 0.3 •Garden 0.15 •Sun 0.05 •Human 0.1 •Park 0.1 d = 0.2 d = 0.6 d = 0.5 ? Slide ‹#› Content-based retrieval for annotations §What we need: § §Large collection of reliably annotated images: Profiset §20 million general-purpose photos from the Profimedia photostock company §Descriptive keywords for each photo provided by authors who want to sell the pictures → rich and reliable annotations §Efficient and effective search: DeCAF descriptors and PPP-codes §DeCAF: 4096-dimensional vector obtained from the last layer of a neural network image classifier §PPP-codes: effective permutation-based metric space indexing method § § § § § § § § Profiset keywords: botany, close, closeup, color, daytime, detail, exterior, flower, germany, hepatica, horticulture, laughingstock, liverwort, lobed, mecklenburg, nature, nobilis, outdoor, outside, plant, pomerania, purple, round, western http://disa.fi.muni.cz/profimedia/imagesJpeg/0009008905 Slide ‹#› ConceptRank §Candidate keyword analysis inspired by Google PageRank §Uses semantic connections between candidate keywords to determine the probability of individual candidates §Main steps: §Construct a graph of candidate keywords related by WordNet semantic links §New candidates can be found during the WordNet exploration §Apply biased random walk with restarts to compute the score of each keyword §Keyword scores from the content-based search are included via the biased restart Similar annotated images •Yellow, bloom, pretty •Meadow, outdoors, dandelion •Mary’s garden, summer d = 0.2 d = 0.6 d = 0.5 • Slide ‹#› Example http://disa.fi.muni.cz/profimedia/imagesJpeg/0077128591 Candidate keywords after CBIR church, architecture, travel, europe, building, religion, germany, buildings, north, churches, christianity, america, religious, exterior, st, historic, world, tourism, united, usa, … 1.Retrieve 100 similar images from Profiset 2.Merge their keywords, compute frequencies 3.Build the semantic network using WordNet 4.Compute the ConceptRank 5.Apply postprocessing & return 20 most probable keywords ConceptRank scores building (2.53), structure (2.41), LANDSCAPE (2.10), BUILDINGS (1.87), OBJECT (1.84), NATURE (1.78), place_of_worship (1.75), church (1.74), Europe (1.68), religion (1.64), continent (1.51), … Final keywords building, structure, church, religion, continent, group, travel, island, sky, architecture, tower, person, belief, locations, chapel, christianity, tourism, regions, country, district Semantic network 4 relationships: hypernym (dog → animal), hyponym (animal → dog), meronym (leaf → tree), holonym (tree → leaf) 270 network nodes, 471 edges Slide ‹#› Annotations in use §Participation in the ImageCLEF 2014 Scalable Annotation Challenge §2nd place, mean average precision of annotation approx. 60 % § §Web demo & Mozilla addon §http://disa.fi.muni.cz/prototype-applications/image-annotation § § § § § § § § § § § • Similarity Search in Streams ▪ ▪Two basic approaches to explore data: ▪Store, pre-process and search later, database processing ▪Process (filter) continuously, stream processing ▪ ▪Examples of stream processing applications: ▪Surveillance camera and event detection ▪Mail stream and spam filter ▪Publish/subscribe applications 41 Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems Stream Processing Scenarios ▪Stream: potentially infinite sequence of data items (d1, d2, …) – tuples, images, frames, etc. ▪Basic scenarios: ▪Data items processed immediately, possible data item skipping → minimize delay - e.g., event detection ▪Process everything as fast as possible, delay possible to maximize throughput - our focus ▪Motivating examples with similarity searching ▪Image annotation – annotate a stream of images collected by a web crawler ▪Publish/subscribe applications – categorize a stream of documents by similarity searching 42 Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems Processing Streams of Query Objects ▪Typical large-scale similarity search approach: ▪partitioned data stored on a disk ▪partition reads from a disk form the bottleneck ▪Idea: similar queries need similar sets of partitions → save accesses ▪Buffer: memory used for reordering (clustering) queries ▪Cache: memory containing previously read data partitions 43 Disk Buffer Query Cache Query Stream Result Metric index Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems Experiment Results ▪100,000 processed 10-NN queries ▪DB: 1 mil. MPEG-7 descriptors ▪Buffer capacity: 8,000 queries ▪Cache size: 40,000 objects (4% of the DB) 44 ▪100,000 processed 10-NN queries ▪DB: 10 mil. MPEG-7 descriptors ▪Buffer capacity: 10,000 queries ▪Cache size: 90,000 objects (0.9% of the DB) Pragu, August 2016 20th East-European Conference on Advances in Databases and Information Systems Similarity Search in Motion Capture (Mocap) Data Digital representations of human motions, recorded by motion capturing devices for further use in a variety of applications. 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 45 Pragu, August 2016 What Is Mocap 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 46 •Digital representation is depicted by series of coordinates of body joints in space-time. •Complex multi-dimensional spatio-temporal data (3D space, 31 joints, 120 frames per second). •Visualized by simplified human skeleton (stick figure), coordinates of joints stored as float numbers. •1 minute of such motion data ≈ 669,600 float numbers. http://cg.cs.uni-bonn.de/aigaion2root/attachments/keyframeSynthesis_web.jpg • Pragu, August 2016 The Need for a Similarity Measure ◦Almost every application of Mocap data (analysis, searching, action recognition, detection, synthesis, clustering) requires a pair-wise action comparison based on similarity. ◦The challenge: •Develop a measure for content-based similarity ◦ comparison of Mocap data. 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 47 ? • Pragu, August 2016 Motion Similarity Problems The same action can be performed differently •by different actors, •in various styles, •in various speed, •or start at different body configurations. ◦ ◦Similarity of motions is application-dependent •e.g., general action recognition vs. ◦ person-identification •there is no universal similarity model 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 48 Pragu, August 2016 Comparing Similarity in Motions General Overview 1. DATA REPRESENTATION absolute coordinates, relative distances, joint rotation angles or velocities (quantization or dimensionality reduction might be applied) 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 49 Machine Learning • Convolutional Neural Networks, Boltzmann M. • Support Vector Machines Special structures • Motion and Action graphs • Temporal pyramids • Hidden Markov models Distance-Based functions • Dynamic Time Warping • k-NN + L2 2. WAYS OF COMPARISON + Pragu, August 2016 https://docs.google.com/document/d/130MFzYC-hVFOCZhe58H7kcKkYuDPJp5p-D9gvWE8eU0/edit DTW is not metric – but approximations exists for indexing Comparing Similarity in Motions Examples 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 50 Joint positions features + Euclidean Distance (Krüger 2010) Fisher Vector + SVM classifier (Evangelidis 2014) Time Series + Dynamic Time Warping (Müller 2009) Pragu, August 2016 https://docs.google.com/document/d/130MFzYC-hVFOCZhe58H7kcKkYuDPJp5p-D9gvWE8eU0/edit Our Approach – Motion Images Every single-frame joint configuration is normalized (by centering and rotating), then transformed into a RGB stripe image while fully preserving skeleton configuration. 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 51 Pragu, August 2016 We focus on body joints, not volumetric model nor skeleton bones Why normalziation is important Motion Images + Caffe 1) Effective transformation from (dynamic) motion capture data into (static) images. 2) Extract fixed-size feature vector using content-based image descriptors. 3) Index for fast and scalable search 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 52 1) Motion Motion Image (hop-one-leg) 2) •<0, 0, 4.56, 0, 7.88, …> Nearest neighbors 3) d = 13.17 d = 10.1 d = 14.2 d = 1.1 4096-dim Caffe vector Pragu, August 2016 Similarity model: Caffe + L2 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 53 Caffe Descriptors Extracted using Convolutional Neural Network trained on 1.3M photographs Network can be fine-tuned to the domain of motion images Output of 7th layer is a 4096-dimensional vector transform Motion Images extract 4096-dimensional features Caffe descriptors are fixed sized vectors extracted from motion images. They are compared for similarity by L2 distance function. Pragu, August 2016 Motion Images – Properties •Pattern recognition is a mature concept nowadays many highly accurate computer vision techniques might be employed. •The proposed similarity measure is robust and tolerant towards inferior data quality, execution speed and imprecise segmentation. • •Fixed-size feature vectors can be indexed in large scale ◦ evaluate a query in one year long Mocap data in less than a second • •Fixed-size feature vectors compress the original data 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 54 Pragu, August 2016 Sizes Compared • 5 seconds of mocap •3 x 31 x 120 x 5 = 57 600 floats •≈ 460 KB • 1 image 256 x 256 px in png format •≈ 5-10 KB •1 caffe descriptor •4096 floats ≈ 32 KB •4096 bits ≈ 1 KB • 1 mpeg7 descriptor •256 floats ≈ 2 KB • • 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 55 Pragu, August 2016 Laboratory of Data Intensive Systems and Applications disa.fi.muni.cz Pragu, August 2016 20TH EAST-EUROPEAN CONFERENCE ON ADVANCES IN DATABASES AND INFORMATION SYSTEMS 56