•Similarity Search: •Part II, Chapter 5 •‹#› SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko book •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •2 Table of Content nPart I: Metric searching in a nutshell nFoundations of metric space searching nSurvey of existing approaches n nPart II: Metric searching in large collections nCentralized index structures nApproximate similarity search nParallel and distributed indexes •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •3 Parallel and Distributed Indexes 1.preliminaries 2.processing M-trees with parallel resources 3.scalable and distributed similarity search 4.performance trials •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •4 Parallel Computing nParallel system qMultiple independent processing units qMultiple independent storage places qShared dedicated communication media qShared data q nExample qProcessors (CPUs) share operating memory (RAM) and use a shared internal bus for communicating with the disks •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •5 Parallel Index Structures nExploiting parallel computing paradigm nSpeeding up the object retrieval qParallel evaluations nusing multiple processors at the same time qParallel data access nseveral independent storage units nImproving responses qCPU and I/O costs •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •6 Parallel Search Measures nThe degree of the parallel improvement nSpeedup qElapsed time of a fixed job run on na small system (ST) na big system (BT) n n qLinear speedup nn-times bigger system yields a speedup of n •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •7 Parallel Search Measures nScaleup qElapsed time of na small problem run on a small system (STSP) na big problem run on a big system (BTBP) n n qLinear scaleup nThe n-times bigger problem on n-times bigger system is evaluated in the same time as needed by the original system to process the original problem size •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •8 Distributed Computing nParallel computing on several computers qIndependent processing and storage units nCPUs and disks of all the participating computers qConnected by a network nHigh speed nLarge scale nInternet, corporate LANs, etc. nPractically unlimited resources •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •9 Distributed Index Structures nData stored on multiple computers qNavigation (routing) algorithms nSolving queries and data updates qNetwork communication nEfficiency and scalability qScalable and Distributed Data Structures qPeer-to-peer networks •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •10 Scalable & Distributed Data Structures nClient/server paradigm qClients pose queries and update data qServers solve queries and store data nNavigation algorithms qUse local information qCan be imprecise nimage adjustment technique to update local info •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •11 •Client Distributed Index Example •Client •Client •Server •Data •Server •Data •Server •Data •Network •Client •Search •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •12 SDDS Properties nScalability qdata migrate to new network nodes gracefully, and only when the network nodes already used are sufficiently loaded n nNo hotspot qthere is no master site that must be accessed for resolving addresses of searched objects, e.g., centralized directory n nIndependence qthe file access and maintenance primitives (search, insert, node split, etc.) never requires atomic updates on multiple nodes •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •13 Peer-to-Peer Data Networks nInherit basic principles of the SDDS nPeers are equal in functionality qComputers participating in the P2P network have the functionality of both the client and the server nAdditional high-availability restrictions qFault-tolerance qRedundancy •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •14 Peer-to-Peer Index Example •Peer •Data •Peer •Data •Peer •Data •Network •Peer •Peer • •Peer •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •15 Parallel and Distributed Indexes 1.preliminaries 2.processing M-trees with parallel resources 3.scalable and distributed similarity search 4.performance trials •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •16 Processing M-trees with parallel resources nParallel extension to the basic M-Tree qTo decrease both the I/O and CPU costs qRange queries qk-NN queries nRestrictions qHierarchical dependencies between tree nodes qPriority queue during the k-NN search •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •17 M-tree: Internal Node (reminder) nInternal node consists of an entry for each subtree nEach entry consists of: qPivot: p qCovering radius of the sub-tree: rc qDistance from p to parent pivot pp: d(p,pp) qPointer to sub-tree: ptr q q q qAll objects in the sub-tree ptr are within the distance rc q from p. •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •18 M-tree: Leaf Node (reminder) nLeaf node contains data entries nEach entry consists of pairs: qObject (its identifier): o qDistance between o and its parent pivot: d(o,op) •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •19 Parallel M-Tree: Lowering CPU costs nInner node parallel acceleration qNode on given level cannot be accessed nUntil all its ancestors have been processed qUp to m processors compute distances to pivots d(q,pi) n n nLeaf node parallel acceleration qIndependent distance evaluation d(q,oi) for all leaf objects n nk-NN query priority queue qOne dedicated CPU •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •20 Parallel M-Tree: Lowering I/O costs nNode accessed in specific order qDetermined by a specific similarity query qFetching nodes into main memory (I/O) nParallel I/O for multiple disks qDistributing nodes among disks qDeclustering to maximize parallel fetch nChoose disk where to place a new node (originating from a split) nDisk with as few nodes with similar objects/regions as possible is a good candidate. •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •21 Parallel M-Tree: Declustering nGlobal allocation declustering qOnly number of nodes stored on a disk taken into account nRound robin strategy to store a new node nRandom strategy qNo data skew nProximity-based allocation declustering qProximity of nodes’ regions determine allocation qChoose the disk with the lowest sum of proximities nbetween the new node region nand all the nodes already stored on the disk •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •22 Parallel M-Tree: Efficiency nExperimental evaluation qGood speedup and scaleup qSequential components not very restrictive nLinear speedup on CPU costs qAdding processors linearly decreased costs nNearly constant scaleup qResponse time practically the same with na five times bigger dataset na five times more processors qLimited by the number of processors •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •23 Parallel M-Tree: Object Declustering nDeclusters objects instead of nodes qInner M-Tree nodes remain the same qLeaf nodes contain pointers to objects nObjects get spread among different disks nSimilar objects are stored on different disks qObjects accessed by a similarity query are maximally distributed among disks nMaximum I/O parallelization qRange query R(oN,d(oN,p)) while inserting oN nChoose the disk for physical storage qwith the minimum number of retrieved objects •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •24 Parallel and Distributed Indexes 1.preliminaries 2.processing M-trees with parallel resources 3.scalable and distributed similarity search 4.performance trials •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •25 Distributed Similarity Search nMetric space indexing technique qGeneralized hyper-plane partitioning nPeer-to-Peer paradigm qSelf organizing qFully scalable qNo centralized components n n nGHT* Structure •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •26 GHT* Architecture nPeers qComputers connected by the network nmessage passing paradigm nrequest and acknowledgment messages qUnique (network node) identifier NNID qIssue queries qInsert/update data qProcess data and answer queries •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •27 GHT* Architecture (cont.) nBuckets qStorage for data nmetric space objects nno knowledge about internal structure qLimited space nSplits/merges possible qHeld by peers, multiple buckets per peer nthere can be no bucket in a peer nidentified by BID, unique within a peer •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •28 GHT* Architecture Schema •Network •Peer 1 • •No buckets •Peer 2 • •Two buckets •Peer 3 • •One bucket •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •29 GHT* Architecture Schema (cont.) •Network •NNID1 •NNID2 •BID1 •BID2 •NNID3 •BID1 •Request and acknowledgment messages •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •30 GHT* Architecture (cont.) nPrecise location of every object qImpossible to maintain on every peer qNavigation needed in the network nAddress search tree (AST) qPresent in every peer qMay be imprecise nrepeating navigation in several steps nimage adjustment •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •31 GHT* Address Search Tree nBased on Generalized Hyperplane Tree nBinary tree nInner nodes qpairs of pivots qserial numbers nLeaf nodes qBID pointers to buckets qNNID pointers to peers •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •BID3 •NNID2 •Peer 2 •2 •2 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •32 GHT* Address Search Tree •Peer 2 •Peer 3 •Peer 1 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •33 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •BID3 •NNID2 •Peer 2 •2 •2 •3 GHT* Inserting Objects nPeer 1 starts inserting an object o qUse local AST qStart from the root qIn every inner node: ntake right branch if n ntake left branch if n qTill a leaf node is reached •BID3 •p1 •p2 •2 •p5 •p6 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •34 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •BID3 •NNID2 •Peer 2 GHT* Inserting Objects (cont.) nPeer 1 inserting the object o qIf a BID pointer is found q nStore the object o into the pointed bucket nThe bucket is local (stored on peer 1) •BID3 •2 •2 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •35 •BID3 •NNID2 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •Peer 2 GHT* Inserting Objects (cont.) nPeer 1 inserting the object o qIf an NNID pointer is found q nThe inserted object o is sent to peer 2 nWhere the insertion resumes q n •NNID2 •Peer 2 •2 •2 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •36 GHT* Binary Path nRepresents an AST traversal path nString of ones and zeros q‘0’ means left branch q‘1’ means right branch nSerial numbers qin inner nodes qdetect obsolete parts nTraversal example: •BID3 •NNID2 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •Peer 2 •2 •2 •3 •2 •3 •1 [2] •0 [3] •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •37 GHT* Binary Path (cont.) nExample of a different path n •BID3 •NNID2 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •Peer 2 •2 •2 •3 •2 •0 [2] •1 [2] •2 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •38 GHT* Storage Management nDatabase grows as new data are inserted nBuckets have limited capacity n nBucket splits qAllocate a new bucket qExtend routing information nchoose new pivots qMove objects •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •39 •AST Splitting nBucket capacity is reached nAllocate a new bucket qEither a new local bucket qor at another peer • •Overfilled bucket •p3 •p4 •BID1 •2 •... •... •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •40 •AST Splitting nBucket capacity is reached nAllocate a new bucket qEither a new local bucket qor at another peer nChoose new pivots nAdjust AST •p8 •p7 • •Overfilled bucket • •New bucket •p3 •p4 •BID1 •2 •... •... •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •41 •AST Splitting nBucket capacity is reached nAllocate a new bucket qEither a new local bucket qor at another peer nChoose new pivots nAdjust AST qInner node with pivots qLeaf node for the new bucket nMove objects •p8 •p7 • •Overfilled bucket • •New bucket •p3 •p4 •2 •... •... •BID1 •1 •BID/NNID •p7 •p8 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •42 Pivot Choosing Algorithm nPivots are pre-selected during insertion qTwo objects are marked at any time qThe marked objects become pivots on split nHeuristic to maximize the distance between pivots qMark the first two inserted objects qWhenever a new object arrives nCompute its distances from the currently marked objects nIf one of the distances is greater n than the distance between n marked objects qchange the marked objects •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •43 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •BID3 •NNID2 •Peer 2 •2 •2 •3 GHT* Range Search nPeer 1 starts evaluating a query R(q,r) qUse the local AST qStart from the root qIn each inner node: ntake right branch if n ntake left branch if n nboth branches can qualify qTill a leaf node is reached in each followed path •BID3 •p1 •p2 •2 •p5 •p6 •3 •NNID2 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •44 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •BID3 •NNID2 •Peer 2 GHT* Range Search (cont.) nPeer 1 evaluating the range query R(q,r) qFor every BID pointer found nSearch the corresponding local bucket nRetrieve all objects o in the bucket that satisfy n nAny centralized similarity search method can be used •BID3 •2 •2 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •45 •BID3 •NNID2 •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •Peer 2 GHT* Range Search (cont.) nPeer 1 evaluating the range query R(q,r) qFor every NNID pointer found nContinue with the search at corresponding peers •NNID2 •Peer 2 •2 •2 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •46 GHT* Range Search (cont.) nPeer 1 evaluating the range query R(q,r) qFor every NNID pointer found nContinue with the search at corresponding peers qBuild BPATH for the traversal qForward the message nDestination peers consult their ASTs qAvoid repeated computations using the BPATH nWait until the results are gathered from all active peers nMerge them with results from local buckets n •Peer 1 •Peer 2 • • • • • • • • • • • • • •BPATH: 1[2] 1[3] • •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •47 GHT* Nearest Neighbor Search nBased on the range search qEstimate the query radius nEvaluate k-nearest neighbors query k-NN(q) qLocate a bucket where q would be inserted nuse the strategy for inserting an object qStart a range query with radius r equal to the distance between q and the k-th nearest neighbor of q in this bucket nIf the bucket contains less than k objects, estimate r using: qan optimistic strategy qan pessimistic strategy qThe result of the range query contains the k-NN result •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •48 GHT* k-NN Search Example nExample 5-NN(q) qUse the insert strategy in the local AST q q q q qUntil a BID pointer is found nContinue searching at other peer whenever an NNID pointer is found qSearch the destination bucket •p5 •p6 •p3 •p4 •p1 •p2 •BID1 •BID2 •BID3 •NNID2 •Peer 2 •2 •2 •3 •BID3 •p1 •p2 •2 •p5 •p6 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •49 GHT* k-NN Search Example (cont.) nExample 5-NN(q) qRetrieve five nearest neighbors of q in the local bucket qSet r to the distance of the fifth nearest neighbor found qEvaluate a distributed range search R(q,r) nresults include at least five nearest neighbors from the local bucket nhowever, some additional objects closer to q can be found qGet the first five nearest objects of R(q,r) q •q •r •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •50 GHT* Updates and Deletions nUpdating an object qDelete the original object qInsert the updated version nDeleting an object qLocate the bucket where the object is stored nthe insert navigation algorithm is used qRemove the object from the bucket qThe bucket occupation may become too low nmerge the bucket with another one nupdate the corresponding nodes in the AST •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •51 GHT* Merging Buckets nRemove a bucket qGet its sibling neither a leaf node (bucket) nor an inner node qReinsert all remaining objects ninto the sibling qmultiple buckets possibly nRemove the inner node Np nIncrease the node’s serial number n remove-in-ast •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •52 •4 •Peer AST: Image Adjustment nThe AST is modified on bucket splits and merges qOnly changed peers are aware of the change (4 and 5) •p5 •p6 •p3 •p4 •p1 •p2 •1 •Peer •2 •Peer •3 •Peer •2 •3 •1 •4 •Peer •5 •Peer •p7 •p8 •1 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •53 nThe AST is modified on bucket splits and merges qOnly changed peers are aware of the change (4 and 5) nWhen other peer searches qForwards the query to a peer AST: Image Adjustment (cont.) •p5 •p6 •p3 •p4 •p1 •p2 •1 •Peer •2 •Peer •3 •Peer •4 •Peer •Search •BPATH: 1[2] 1[3] •2 •3 •1 •p5 •p6 •p1 •p2 •4 •Peer •2 •3 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •54 nThe AST is modified on bucket splits and merges qOnly changed peers are aware of the change (4 and 5) nWhen other peer searches qForwards the query to a peer nwhich has a different AST view qThe incomplete search is detected nby too short BPATH qThe search evaluation resumes npossibly forwarding the query to some other peers AST: Image Adjustment (cont.) •p3 •p4 •1 •Peer •2 •Peer •3 •Peer •Search •BPATH: 1[2] 1[3] •p1 •p2 •2 •p5 •p6 •3 •1 •4 •Peer •5 •Peer •p7 •p8 •1 •1[1] •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •55 nThe AST is modified on bucket splits and merges qOnly changed peers are aware of the change (4 and 5) nWhen other peer searches qForwards the query to a peer nwhich has a different AST view qThe incomplete search is detected nby too short BPATH qThe search evaluation resumes npossibly forwarding the query to some other peers nImage adjustment is sent back AST: Image Adjustment (cont.) •p3 •p4 •1 •Peer •2 •Peer •3 •Peer •p1 •p2 •2 •p5 •p6 •3 •1 •4 •Peer •5 •Peer •p7 •p8 •1 •4 • •5 • •p •p •1 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •56 AST: Logarithmic Replication nThe full AST on every peer is space consuming qmany pivots must be replicated at each peer nOnly a limited AST stored qall paths to local buckets qnothing more nHidden parts qreplaced by the NNIDs of the leftmost peers •p13 •p14 •p11 •p12 •p5 •p6 •p1 •p2 •p3 •p4 •p7 •p8 •p9 •p10 •NNID2 •NNID3 •BID1 •NNID4 •NNID5 •NNID6 •NNID7 •NNID8 •p1 •p2 •p3 •p4 •p7 •p8 •BID1 •NNID3 •NNID5 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •57 AST: Logarithmic Replication (cont.) nResult of logarithmic replication qThe partial AST n nHidden parts qreplaced by the NNIDs of the leftmost peers n •p1 •p2 •p3 •p4 •p7 •p8 •NNID2 •NNID3 •BID1 •NNID5 •p1 •p2 •p3 •p4 •p7 •p8 •BID1 •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •58 GHT* Joining P2P Network nA new node joining the network sends “I’m here” qReceived by each active peer qPeers add the node to their lists of available peers nIf a node is needed by a split qGet one peer from the list nsend an activation request qThe peer sends “I’m being used” nthe other peers remove it from their lists qThe peer is “Ready to serve” •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •59 GHT* Leaving P2P Network nUnexpected leaves not handled qRequires replication or other fault-tolerant techniques nPeers without storage qCan leave without restrictions nPeers storing some data qDelete all stored data nall buckets are merged qReinsert data back to the structure nwithout offering its own storage capacity nBetter leaving/fault-tolerant is a research challenge •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •60 Parallel and Distributed Indexes 1.preliminaries 2.processing M-trees with parallel resources 3.scalable and distributed similarity search 4.performance trials •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •61 Performance Trials nObjectives: show the performance of the distributed similarity search index structure n nThe same datasets as for the centralized ones qComparison possible q ÞExperiments show that the response times are nearly constant •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •62 Datasets nTrials performed on two datasets: qVEC: 45-dimensional vectors of image color features compared by the quadratic distance measure qSTR: sentences of a Czech language corpus compared by the edit distance •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •63 Datasets: Distance Distribution nDistribution of the distances within the datasets qVEC: practically normal distance distribution qSTR: skewed distribution •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •64 Computing Infrastructure n300 Intel Pentium workstations qLinux operating system qavailable for use to university students nConnected by a 100Mbps network qaccess times approximately 5ms nMemory based buckets qlimited capacity - up to 1,000 objects nBasic datasets: q100,000 objects q25 peers •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •65 Performance Trials: Measures nDistance computations qNumber of all evaluations of the metric function neither in the AST or in buckets qRepresent the CPU costs ndepends on the metric function complexity qthe evaluation may vary from hundreds of nanoseconds to seconds n nAccessed buckets qNumber of buckets accessed during a query evaluation qRepresents the I/O costs •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •66 Performance Trials: Measures (cont.) nMessages sent qTransmitted between peers using the computer network qRepresent the communication costs ndepends on the size of the sent objects •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •67 Performance Trials: Remarks nResponse times are imprecise: qnot dedicated computers qdepend on the actual load of used computers and the underlying network qother influences q nQuery objects follow the dataset distribution nAverage over 50 queries: qdifferent query objects qthe same selectivity (radius or number of nearest neighbors) •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •68 Performance Trials: Outline nPerformance of similarity queries qGlobal costs nCPU, I/O and communication nsimilar to the centralized structures qParallel costs qComparison of range and k-nearest neighbors queries nData volume scalability qCosts changes while increasing the size of the data nIntraquery parallelism nInterquery parallelism •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •69 Similarity Queries Global Costs nChanging range query radius nResult set size qgrows exponentially nBuckets accessed (I/O costs) qgrows practically linearly nSimilar to centralized structures nPeers accessed qOnly slight increase nmore buckets accessed per peer exp-bucket-vec exp-bucket-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •70 Similarity Queries Global Costs nChanging k for k-NN queries qlogarithmic scale nBuckets accessed qgrows very quickly as k increases nk-NN is very expensive qsimilar to centralized structures nPeers accessed qfollows the number of buckets qpractically all buckets per peer are accessed for higher values of k knn-bucket-vec knn-bucket-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •71 Similarity Queries Global Costs nChanging range query radius nDistance computations (CPU costs) qDivided for AST and buckets nsmall percentage of distance comp. during the AST navigation qBuckets use linear scan nall objects must be accessed nno additional pruning technique used nSimilar to centralized structures exp-dist-vec exp-dist-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •72 Similarity Queries Global Costs nChanging k for k-NN queries qlogarithmic scale nDistance computations qonly a small percentage of distance computations during the AST navigation is needed nk-NN very expensive qalso with respect to the CPU costs knn-dist-vec knn-dist-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •73 Similarity Queries Global Costs nChanging range query radius nNumber of messages (Communication costs) qDivided for requests and forwards nForward messages means misaddressing nOnly 10% messages forwarded qeven though logarithmic replication used nNo communication in centralized structures exp-msg-edit exp-msg-vec •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •74 Similarity Queries Global Costs nChanging k for k-NN queries qlogarithmic scale nNumber of messages qvery small number of messages forwarded qcorresponds with the number of peers accessed npractically all peers accessed for k greater than 100 qSlightly higher than for range queries knn-msg-vec knn-msg-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •75 Similarity Queries Global Costs nGHT* is comparable to centralized structures qNo pruning techniques in buckets nslightly increased number of distance computations qBuckets accessed on peers nnot fixed size of blocks, but fixed bucket capacity nTrends are similar qCosts increase linearly n •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •76 Similarity Queries Parallel Costs nCorrespond to the actual response times nMore difficult to measure qMaximum of the serial costs from all accessed peers qExample: the parallel distance comp. of a range query nnumber of distance computations at each peer accessed qat a peer, it is a sum of costs for accessed buckets nmaximum of the values needed on active peers nk-NN has the serial phase of locating the first bucket qwe must sum the first part with the range query costs qadditional serial iterations may be required if optimistic/pessimistic strategy is used •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •77 Similarity Queries Parallel Costs nChanging range query radius nParallel buckets accessed (I/O costs) qMaximal number of buckets accessed per peer qIt is bounded by the capacity na peer has at most five buckets nNot affected by the query size exp-buckpar-vec exp-buckpar-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •78 Similarity Queries Parallel Costs nChanging k for k-NN queries qlogarithmic scale nIterations qone additional optimistic strategy iteration for k greater than 1,000 nParallel bucket access costs q bounded by the capacity npractically all 5 buckets per peer are always accessed qsecond iteration increases the costs knn-buckpar-vec knn-buckpar-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •79 Similarity Queries Parallel Costs nChanging the range query radius nParallel distance computations (CPU costs) qMaximal number of distance computations per peer nthe costs of the linear scans of the peer’s accessed buckets qIt is bounded by the capacity na peer has maximally five buckets of maximally 1,000 objects nGood response even for large radii exp-distpar-vec exp-distpar-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •80 Similarity Queries Parallel Costs nChanging k for k-NN queries qlogarithmic scale nParallel distance computations qbounded by the capacity nmaximally 5,000 distance computations per peer nall objects per peer are evaluated qSecond iteration (k > 1,000) nIncreases the cost nAlthough k-NN query is expensive, the CPU costs are bounded knn-distpar-vec knn-distpar-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •81 Similarity Queries Parallel Costs nMeasure for the messages sent (the communication costs) qduring the query execution, the peer may send messages to several other peers nthe cost is equal to sending only one, because the peer sends them all at once qthe serial part is thus the forwarding nThe number of peers sequentially contacted qhop count •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •82 Similarity Queries Parallel Costs nChanging range query radius nHop count (Communication costs) qlogarithmically proportional to the number of peers accessed qin practice, this cost is very hard to notice nforwarding is executed before the local buckets scan exp-parmsg-edit exp-parmsg-vec •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •83 Similarity Queries Parallel Costs nChanging k for k-NN queries qlogarithmic scale nHop count qSince only few messages are forwarded, the k-NN queries have practically the same costs as the range queries qSmall amount of additional hops during the second phase napproximately one additional hop is needed knn-parmsg-edit knn-parmsg-vec •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •84 Similarity Queries Comparison nk-NN and range queries qlogarithmic scale qrange query has the radius set to the distance of the k-th nearest object nthat is the perfect estimate nTotal distance computations qthe k-NN query is slightly more expensive than the range query nParallel distance computations qclearly visible differences of the first phase and additional iteration(s) knn-range-vec knn-range-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •85 Similarity Queries Parallel Costs nGHT* real costs summary qthe real response of the indexing system nGHT* exhibits qconstant parallel CPU costs ndistance computations bounded by bucket capacity qConstant parallel I/O costs nnumber of buckets accessed bounded by peer capacity qLogarithmic parallel communication costs neven with the logarithmic replication •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •86 Data volume scalability nDataset gradually expanded to 1,000,000 objects qmeasurements after every increment of 2,000 objects nIntraquery parallelism qparallel response of a query measured in distance comp. qmaximum of costs incurred at peers involved in the query nInterquery parallelism qsimplified by the ratio of the number of peers involved in a query to the total number of peers qthe lower the ratio, the higher the chances for other queries to be executed in parallel •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •87 Data volume scalability nChanging dataset size qtwo different query radii nIntraquery parallelism qPractically constant responses neven for the growing dataset nsome irregularities for small datasets observed qLarger radii result in higher costs nthough, not much exp-intraquery-vect exp-intraquery-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •88 Data volume scalability nChanging dataset size qtwo different k for k-NN qcorresponding range queries nIntraquery parallelism qby analogy to range queries the responses are nearly constant qThere is a small difference for different values of k knn-intra-vec knn-intra-edit •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •89 Data volume scalability nChanging dataset size qTwo different query radii nInterquery parallelism qAs the size of the dataset increases, the interquery parallelism gets better qBetter for the smaller radii nsmaller percentage of peers involved in a query exp-interquery-edit exp-interquery-vect •Similarity Search: •Part II, Chapter 5 •‹#› •Similarity Search: •Part II, Chapter 5 •90 Data volume scalability nGHT* scalability for one query qIntraquery parallelism nboth the AST navigation and the bucket search qRemains practically constant for growing datasets nGHT* scalability for multiple queries qInterquery parallelism na simplification by percentage of used peers qAllows more queries executed at the same time as the dataset grows