•Similarity Search:
•Part II, Chapter 5
•‹#›
SIMILARITY SEARCH
The Metric Space Approach
Pavel Zezula, Giuseppe Amato,
Vlastislav Dohnal, Michal Batko
book

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•2
Table of Content
nPart I: Metric searching in a nutshell
nFoundations of metric space searching
nSurvey of existing approaches
n
nPart II: Metric searching in large collections
nCentralized index structures
nApproximate similarity search
nParallel and distributed indexes

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•3
Parallel and Distributed Indexes
1.preliminaries
2.processing M-trees with parallel resources
3.scalable and distributed similarity search
4.performance trials

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•4
Parallel Computing
nParallel system
qMultiple independent processing units
qMultiple independent storage places
qShared dedicated communication media
qShared data
q
nExample
qProcessors (CPUs) share operating memory (RAM) and use a shared internal bus for communicating
with the disks

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•5
Parallel Index Structures
nExploiting parallel computing paradigm
nSpeeding up the object retrieval
qParallel evaluations
nusing multiple processors at the same time
qParallel data access
nseveral independent storage units
nImproving responses
qCPU and I/O costs

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•6
Parallel Search Measures
nThe degree of the parallel improvement
nSpeedup
qElapsed time of a fixed job run on
na small system (ST)
na big system (BT)
n
n
qLinear speedup
nn-times bigger system yields a speedup of n

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•7
Parallel Search Measures
nScaleup
qElapsed time of
na small problem run on a small system (STSP)
na big problem run on a big system (BTBP)
n
n
qLinear scaleup
nThe n-times bigger problem on n-times bigger system is evaluated in the same time as needed by the
original system to process the original problem size

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•8
Distributed Computing
nParallel computing on several computers
qIndependent processing and storage units
nCPUs and disks of all the participating computers
qConnected by a network
nHigh speed
nLarge scale
nInternet, corporate LANs, etc.
nPractically unlimited resources

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•9
Distributed Index Structures
nData stored on multiple computers
qNavigation (routing) algorithms
nSolving queries and data updates
qNetwork communication
nEfficiency and scalability
qScalable and Distributed Data Structures
qPeer-to-peer networks

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•10
Scalable & Distributed Data Structures
nClient/server paradigm
qClients pose queries and update data
qServers solve queries and store data
nNavigation algorithms
qUse local information
qCan be imprecise
nimage adjustment technique to update local info

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•11
•Client
Distributed Index Example
•Client
•Client
•Server
•Data
•Server
•Data
•Server
•Data
•Network
•Client
•Search

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•12
SDDS Properties
nScalability
qdata migrate to new network nodes gracefully, and only when the network nodes already used are
sufficiently loaded
n
nNo hotspot
qthere is no master site that must be accessed for resolving addresses of searched objects, e.g.,
centralized directory
n
nIndependence
qthe file access and maintenance primitives (search, insert, node split, etc.) never requires
atomic updates on multiple nodes

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•13
Peer-to-Peer Data Networks
nInherit basic principles of the SDDS
nPeers are equal in functionality
qComputers participating in the P2P network have the functionality of both the client and the
server
nAdditional high-availability restrictions
qFault-tolerance
qRedundancy

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•14
Peer-to-Peer Index Example
•Peer
•Data
•Peer
•Data
•Peer
•Data
•Network
•Peer
•Peer
•
•Peer

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•15
Parallel and Distributed Indexes
1.preliminaries
2.processing M-trees with parallel resources
3.scalable and distributed similarity search
4.performance trials

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•16
Processing M-trees with parallel resources
nParallel extension to the basic M-Tree
qTo decrease both the I/O and CPU costs
qRange queries
qk-NN queries
nRestrictions
qHierarchical dependencies between tree nodes
qPriority queue during the k-NN search

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•17
M-tree: Internal Node (reminder)
nInternal node consists of an entry for each subtree
nEach entry consists of:
qPivot: p
qCovering radius of the sub-tree: rc
qDistance from p to parent pivot pp: d(p,pp)
qPointer to sub-tree: ptr
q
q
q
qAll objects in the sub-tree ptr are within the distance rc
q from p.

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•18
M-tree: Leaf Node (reminder)
nLeaf node contains data entries
nEach entry consists of pairs:
qObject (its identifier): o
qDistance between o and its parent pivot: d(o,op)

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•19
Parallel M-Tree: Lowering CPU costs
nInner node parallel acceleration
qNode on given level cannot be accessed
nUntil all its ancestors have been processed
qUp to m processors compute distances to pivots d(q,pi)
n
n
nLeaf node parallel acceleration
qIndependent distance evaluation d(q,oi) for all leaf objects
n
nk-NN query priority queue
qOne dedicated CPU

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•20
Parallel M-Tree: Lowering I/O costs
nNode accessed in specific order
qDetermined by a specific similarity query
qFetching nodes into main memory (I/O)
nParallel I/O for multiple disks
qDistributing nodes among disks
qDeclustering to maximize parallel fetch
nChoose disk where to place a new node (originating from a split)
nDisk with as few nodes with similar objects/regions as possible is a good candidate.

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•21
Parallel M-Tree: Declustering
nGlobal allocation declustering
qOnly number of nodes stored on a disk taken into account
nRound robin strategy to store a new node
nRandom strategy
qNo data skew
nProximity-based allocation declustering
qProximity of nodes’ regions determine allocation
qChoose the disk with the lowest sum of proximities
nbetween the new node region
nand all the nodes already stored on the disk

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•22
Parallel M-Tree: Efficiency
nExperimental evaluation
qGood speedup and scaleup
qSequential components not very restrictive
nLinear speedup on CPU costs
qAdding processors linearly decreased costs
nNearly constant scaleup
qResponse time practically the same with
na five times bigger dataset
na five times more processors
qLimited by the number of processors

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•23
Parallel M-Tree: Object Declustering
nDeclusters objects instead of nodes
qInner M-Tree nodes remain the same
qLeaf nodes contain pointers to objects
nObjects get spread among different disks
nSimilar objects are stored on different disks
qObjects accessed by a similarity query are maximally distributed among disks
nMaximum I/O parallelization
qRange query R(oN,d(oN,p)) while inserting oN
nChoose the disk for physical storage
qwith the minimum number of retrieved objects

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•24
Parallel and Distributed Indexes
1.preliminaries
2.processing M-trees with parallel resources
3.scalable and distributed similarity search
4.performance trials

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•25
Distributed Similarity Search
nMetric space indexing technique
qGeneralized hyper-plane partitioning
nPeer-to-Peer paradigm
qSelf organizing
qFully scalable
qNo centralized components
n
n
nGHT* Structure

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•26
GHT* Architecture
nPeers
qComputers connected by the network
nmessage passing paradigm
nrequest and acknowledgment messages
qUnique (network node) identifier NNID
qIssue queries
qInsert/update data
qProcess data and answer queries

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•27
GHT* Architecture (cont.)
nBuckets
qStorage for data
nmetric space objects
nno knowledge about internal structure
qLimited space
nSplits/merges possible
qHeld by peers, multiple buckets per peer
nthere can be no bucket in a peer
nidentified by BID, unique within a peer

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•28
GHT* Architecture Schema
•Network
•Peer 1
•
•No buckets
•Peer 2
•
•Two buckets
•Peer 3
•
•One bucket

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•29
GHT* Architecture Schema (cont.)
•Network
•NNID1
•NNID2
•BID1
•BID2
•NNID3
•BID1
•Request and acknowledgment messages

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•30
GHT* Architecture (cont.)
nPrecise location of every object
qImpossible to maintain on every peer
qNavigation needed in the network
nAddress search tree (AST)
qPresent in every peer
qMay be imprecise
nrepeating navigation in several steps
nimage adjustment

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•31
GHT* Address Search Tree
nBased on Generalized Hyperplane Tree
nBinary tree
nInner nodes
qpairs of pivots
qserial numbers
nLeaf nodes
qBID pointers to buckets
qNNID pointers to peers
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•BID3
•NNID2
•Peer 2
•2
•2
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•32
GHT* Address Search Tree
•Peer 2
•Peer 3
•Peer 1

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•33
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•BID3
•NNID2
•Peer 2
•2
•2
•3
GHT* Inserting Objects
nPeer 1 starts inserting an object o
qUse local AST
qStart from the root
qIn every inner node:
ntake right branch if
n
ntake left branch if
n
qTill a leaf node is reached
•BID3
•p1
•p2
•2
•p5
•p6
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•34
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•BID3
•NNID2
•Peer 2
GHT* Inserting Objects (cont.)
nPeer 1 inserting the object o
qIf a BID pointer is found
q
nStore the object o into
the pointed bucket
nThe bucket is local
(stored on peer 1)
•BID3
•2
•2
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•35
•BID3
•NNID2
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•Peer 2
GHT* Inserting Objects (cont.)
nPeer 1 inserting the object o
qIf an NNID pointer is found
q
nThe inserted object o
is sent to peer 2
nWhere the insertion
resumes
q
n
•NNID2
•Peer 2
•2
•2
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•36
GHT* Binary Path
nRepresents an AST traversal path
nString of ones and zeros
q‘0’ means left branch
q‘1’ means right branch
nSerial numbers
qin inner nodes
qdetect obsolete parts
nTraversal example:
•BID3
•NNID2
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•Peer 2
•2
•2
•3
•2
•3
•1 [2]
•0 [3]

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•37
GHT* Binary Path (cont.)
nExample of a different path
n
•BID3
•NNID2
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•Peer 2
•2
•2
•3
•2
•0 [2]
•1 [2]
•2

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•38
GHT* Storage Management
nDatabase grows as new data are inserted
nBuckets have limited capacity
n
nBucket splits
qAllocate a new bucket
qExtend routing information
nchoose new pivots
qMove objects

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•39
•AST
Splitting
nBucket capacity is reached
nAllocate a new bucket
qEither a new local bucket
qor at another peer
•
•Overfilled bucket
•p3
•p4
•BID1
•2
•...
•...

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•40
•AST
Splitting
nBucket capacity is reached
nAllocate a new bucket
qEither a new local bucket
qor at another peer
nChoose new pivots
nAdjust AST
•p8
•p7
•
•Overfilled bucket
•
•New bucket
•p3
•p4
•BID1
•2
•...
•...

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•41
•AST
Splitting
nBucket capacity is reached
nAllocate a new bucket
qEither a new local bucket
qor at another peer
nChoose new pivots
nAdjust AST
qInner node with pivots
qLeaf node for the
new bucket
nMove objects
•p8
•p7
•
•Overfilled bucket
•
•New bucket
•p3
•p4
•2
•...
•...
•BID1
•1
•BID/NNID
•p7
•p8

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•42
Pivot Choosing Algorithm
nPivots are pre-selected during insertion
qTwo objects are marked at any time
qThe marked objects become pivots on split
nHeuristic to maximize the distance between pivots
qMark the first two inserted objects
qWhenever a new object arrives
nCompute its distances from
the currently marked objects
nIf one of the distances is greater
n than the distance between
n marked objects
qchange the marked objects

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•43
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•BID3
•NNID2
•Peer 2
•2
•2
•3
GHT* Range Search
nPeer 1 starts evaluating a query R(q,r)
qUse the local AST
qStart from the root
qIn each inner node:
ntake right branch if
n
ntake left branch if
n
nboth branches can qualify
qTill a leaf node is reached
in each followed path
•BID3
•p1
•p2
•2
•p5
•p6
•3
•NNID2

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•44
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•BID3
•NNID2
•Peer 2
GHT* Range Search (cont.)
nPeer 1 evaluating the range query R(q,r)
qFor every BID pointer found
nSearch the corresponding
local bucket
nRetrieve all objects o in
the bucket that satisfy
n
nAny centralized similarity
search method can be used
•BID3
•2
•2
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•45
•BID3
•NNID2
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•Peer 2
GHT* Range Search (cont.)
nPeer 1 evaluating the range query R(q,r)
qFor every NNID pointer found
nContinue with the search
at corresponding peers
•NNID2
•Peer 2
•2
•2
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•46
GHT* Range Search (cont.)
nPeer 1 evaluating the range query R(q,r)
qFor every NNID pointer found
nContinue with the search
at corresponding peers
qBuild BPATH for the traversal
qForward the message
nDestination peers consult their ASTs
qAvoid repeated computations
using the BPATH
nWait until the results are
gathered from all active peers
nMerge them with results
from local buckets
n
•Peer 1
•Peer 2
•
•
•
•
•
•
•
•
•
•
•
•
•
•BPATH: 1[2] 1[3]
•

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•47
GHT* Nearest Neighbor Search
nBased on the range search
qEstimate the query radius
nEvaluate k-nearest neighbors query k-NN(q)
qLocate a bucket where q would be inserted
nuse the strategy for inserting an object
qStart a range query with radius r equal to the distance between q and the k-th nearest neighbor of
q in this bucket
nIf the bucket contains less than k objects, estimate r using:
qan optimistic strategy
qan pessimistic strategy
qThe result of the range query contains the k-NN result

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•48
GHT* k-NN Search Example
nExample 5-NN(q)
qUse the insert strategy in the local AST
q
q
q
q
qUntil a BID pointer is found
nContinue searching at other
peer whenever an NNID
pointer is found
qSearch the destination bucket
•p5
•p6
•p3
•p4
•p1
•p2
•BID1
•BID2
•BID3
•NNID2
•Peer 2
•2
•2
•3
•BID3
•p1
•p2
•2
•p5
•p6
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•49
GHT* k-NN Search Example (cont.)
nExample 5-NN(q)
qRetrieve five nearest neighbors of q in the local bucket
qSet r to the distance of the fifth
nearest neighbor found
qEvaluate a distributed range
search R(q,r)
nresults include at least five
nearest neighbors from the local bucket
nhowever, some additional objects
closer to q can be found
qGet the first five nearest objects of R(q,r)
q
•q
•r

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•50
GHT* Updates and Deletions
nUpdating an object
qDelete the original object
qInsert the updated version
nDeleting an object
qLocate the bucket where the object is stored
nthe insert navigation algorithm is used
qRemove the object from the bucket
qThe bucket occupation may become too low
nmerge the bucket with another one
nupdate the corresponding nodes in the AST

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•51
GHT* Merging Buckets
nRemove a bucket
qGet its sibling
neither a leaf node (bucket)
nor an inner node
qReinsert all remaining objects
ninto the sibling
qmultiple buckets possibly
nRemove the inner node Np
nIncrease the node’s serial number
n
remove-in-ast

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•52
•4
•Peer
AST: Image Adjustment
nThe AST is modified on bucket splits and merges
qOnly changed peers are aware of the change (4 and 5)
•p5
•p6
•p3
•p4
•p1
•p2
•1
•Peer
•2
•Peer
•3
•Peer
•2
•3
•1
•4
•Peer
•5
•Peer
•p7
•p8
•1

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•53
nThe AST is modified on bucket splits and merges
qOnly changed peers are aware of the change (4 and 5)
nWhen other peer searches
qForwards the query to a peer
AST: Image Adjustment (cont.)
•p5
•p6
•p3
•p4
•p1
•p2
•1
•Peer
•2
•Peer
•3
•Peer
•4
•Peer
•Search
•BPATH: 1[2] 1[3]
•2
•3
•1
•p5
•p6
•p1
•p2
•4
•Peer
•2
•3

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•54
nThe AST is modified on bucket splits and merges
qOnly changed peers are aware of the change (4 and 5)
nWhen other peer searches
qForwards the query to a peer
nwhich has a different AST view
qThe incomplete search is detected
nby too short BPATH
qThe search evaluation resumes
npossibly forwarding the query
to some other peers
AST: Image Adjustment (cont.)
•p3
•p4
•1
•Peer
•2
•Peer
•3
•Peer
•Search
•BPATH: 1[2] 1[3]
•p1
•p2
•2
•p5
•p6
•3
•1
•4
•Peer
•5
•Peer
•p7
•p8
•1
•1[1]

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•55
nThe AST is modified on bucket splits and merges
qOnly changed peers are aware of the change (4 and 5)
nWhen other peer searches
qForwards the query to a peer
nwhich has a different AST view
qThe incomplete search is detected
nby too short BPATH
qThe search evaluation resumes
npossibly forwarding the query
to some other peers
nImage adjustment is sent back
AST: Image Adjustment (cont.)
•p3
•p4
•1
•Peer
•2
•Peer
•3
•Peer
•p1
•p2
•2
•p5
•p6
•3
•1
•4
•Peer
•5
•Peer
•p7
•p8
•1
•4
•
•5
•
•p
•p
•1

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•56
AST: Logarithmic Replication
nThe full AST on every peer is space consuming
qmany pivots must be replicated at each peer
nOnly a limited AST stored
qall paths to local buckets
qnothing more
nHidden parts
qreplaced by the NNIDs
of the leftmost peers
•p13
•p14
•p11
•p12
•p5
•p6
•p1
•p2
•p3
•p4
•p7
•p8
•p9
•p10
•NNID2
•NNID3
•BID1
•NNID4
•NNID5
•NNID6
•NNID7
•NNID8
•p1
•p2
•p3
•p4
•p7
•p8
•BID1
•NNID3
•NNID5

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•57
AST: Logarithmic Replication (cont.)
nResult of logarithmic replication
qThe partial AST
n
nHidden parts
qreplaced by the NNIDs
of the leftmost peers
n
•p1
•p2
•p3
•p4
•p7
•p8
•NNID2
•NNID3
•BID1
•NNID5
•p1
•p2
•p3
•p4
•p7
•p8
•BID1

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•58
GHT* Joining P2P Network
nA new node joining the network sends “I’m here”
qReceived by each active peer
qPeers add the node to their
lists of available peers
nIf a node is needed by a split
qGet one peer from the list
nsend an activation request
qThe peer sends “I’m being used”
nthe other peers remove it from their lists
qThe peer is “Ready to serve”

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•59
GHT* Leaving P2P Network
nUnexpected leaves not handled
qRequires replication or other fault-tolerant techniques
nPeers without storage
qCan leave without restrictions
nPeers storing some data
qDelete all stored data
nall buckets are merged
qReinsert data back to the structure
nwithout offering its own storage capacity
nBetter leaving/fault-tolerant is a research challenge

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•60
Parallel and Distributed Indexes
1.preliminaries
2.processing M-trees with parallel resources
3.scalable and distributed similarity search
4.performance trials

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•61
Performance Trials
nObjectives: show the performance of the distributed similarity search index structure
n
nThe same datasets as for the centralized ones
qComparison possible
q
ÞExperiments show that the response times  are nearly constant

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•62
Datasets
nTrials performed on two datasets:
qVEC: 45-dimensional vectors of image color features compared by the quadratic distance measure
qSTR: sentences of a Czech language corpus compared by the  edit distance

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•63
Datasets: Distance Distribution
nDistribution of the distances within the datasets
qVEC: practically normal distance distribution
qSTR: skewed distribution

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•64
Computing Infrastructure
n300 Intel Pentium workstations
qLinux operating system
qavailable for use to university students
nConnected by a 100Mbps network
qaccess times approximately 5ms
nMemory based buckets
qlimited capacity - up to 1,000 objects
nBasic datasets:
q100,000 objects
q25 peers

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•65
Performance Trials: Measures
nDistance computations
qNumber of all evaluations of the metric function
neither in the AST or in buckets
qRepresent the CPU costs
ndepends on the metric function complexity
qthe evaluation may vary from hundreds of nanoseconds to seconds
n
nAccessed buckets
qNumber of buckets accessed during a query evaluation
qRepresents the I/O costs

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•66
Performance Trials: Measures (cont.)
nMessages sent
qTransmitted between peers using the computer network
qRepresent the communication costs
ndepends on the size of the sent objects

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•67
Performance Trials: Remarks
nResponse times are imprecise:
qnot dedicated computers
qdepend on the actual load of used computers and the underlying network
qother influences
q
nQuery objects follow the dataset distribution
nAverage over 50 queries:
qdifferent query objects
qthe same selectivity (radius or number of nearest neighbors)

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•68
Performance Trials: Outline
nPerformance of similarity queries
qGlobal costs
nCPU, I/O and communication
nsimilar to the centralized structures
qParallel costs
qComparison of range and k-nearest neighbors queries
nData volume scalability
qCosts changes while increasing the size of the data
nIntraquery parallelism
nInterquery parallelism

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•69
Similarity Queries Global Costs
nChanging range query radius
nResult set size
qgrows exponentially
nBuckets accessed
(I/O costs)
qgrows practically linearly
nSimilar to centralized structures
nPeers accessed
qOnly slight increase
nmore buckets accessed per peer
exp-bucket-vec exp-bucket-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•70
Similarity Queries Global Costs
nChanging k for k-NN queries
qlogarithmic scale
nBuckets accessed
qgrows very quickly as k increases
nk-NN is very expensive
qsimilar to centralized structures
nPeers accessed
qfollows the number of buckets
qpractically all buckets per peer are accessed for higher values of k
knn-bucket-vec knn-bucket-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•71
Similarity Queries Global Costs
nChanging range query radius
nDistance computations
(CPU costs)
qDivided for AST and buckets
nsmall percentage of distance comp. during the AST navigation
qBuckets use linear scan
nall objects must be accessed
nno additional pruning technique used
nSimilar to centralized structures
exp-dist-vec exp-dist-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•72
Similarity Queries Global Costs
nChanging k for k-NN queries
qlogarithmic scale
nDistance computations
qonly a small percentage of distance computations during the AST navigation is needed
nk-NN very expensive
qalso with respect to the CPU costs
knn-dist-vec knn-dist-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•73
Similarity Queries Global Costs
nChanging range query radius
nNumber of messages
(Communication costs)
qDivided for requests and forwards
nForward messages means misaddressing
nOnly 10% messages forwarded
qeven though logarithmic replication used
nNo communication in centralized structures
exp-msg-edit exp-msg-vec

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•74
Similarity Queries Global Costs
nChanging k for k-NN queries
qlogarithmic scale
nNumber of messages
qvery small number of messages forwarded
qcorresponds with the number of peers accessed
npractically all peers accessed for k greater than 100
qSlightly higher than for range queries
knn-msg-vec knn-msg-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•75
Similarity Queries Global Costs
nGHT* is comparable to centralized structures
qNo pruning techniques in buckets
nslightly increased number of distance computations
qBuckets accessed on peers
nnot fixed size of blocks, but fixed bucket capacity
nTrends are similar
qCosts increase linearly
n

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•76
Similarity Queries Parallel Costs
nCorrespond to the actual response times
nMore difficult to measure
qMaximum of the serial costs from all accessed peers
qExample: the parallel distance comp. of a range query
nnumber of distance computations at each peer accessed
qat a peer, it is a sum of costs for accessed buckets
nmaximum of the values needed on active peers
nk-NN has the serial phase of locating the first bucket
qwe must sum the first part with the range query costs
qadditional serial iterations may be required if optimistic/pessimistic strategy is used

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•77
Similarity Queries Parallel Costs
nChanging range query radius
nParallel buckets accessed
(I/O costs)
qMaximal number of buckets accessed per peer
qIt is bounded by the capacity
na peer has at most five buckets
nNot affected by the query size
exp-buckpar-vec exp-buckpar-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•78
Similarity Queries Parallel Costs
nChanging k for k-NN queries
qlogarithmic scale
nIterations
qone additional optimistic strategy iteration for k greater than 1,000
nParallel bucket access costs
q bounded by the capacity
npractically all 5 buckets per peer are always accessed
qsecond iteration increases the costs
knn-buckpar-vec knn-buckpar-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•79
Similarity Queries Parallel Costs
nChanging the range query radius
nParallel distance computations
(CPU costs)
qMaximal number of distance computations per peer
nthe costs of the linear scans of the peer’s accessed buckets
qIt is bounded by the capacity
na peer has maximally five buckets of maximally 1,000 objects
nGood response even for large radii
exp-distpar-vec exp-distpar-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•80
Similarity Queries Parallel Costs
nChanging k for k-NN queries
qlogarithmic scale
nParallel distance computations
qbounded by the capacity
nmaximally 5,000 distance computations per peer
nall objects per peer are evaluated
qSecond iteration (k > 1,000)
nIncreases the cost
nAlthough k-NN query is expensive, the CPU costs are bounded
knn-distpar-vec knn-distpar-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•81
Similarity Queries Parallel Costs
nMeasure for the messages sent
(the communication costs)
qduring the query execution, the peer may send messages to several other peers
nthe cost is equal to sending only one, because the peer sends them all at once
qthe serial part is thus the forwarding
nThe number of peers sequentially contacted
qhop count

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•82
Similarity Queries Parallel Costs
nChanging range query radius
nHop count
(Communication costs)
qlogarithmically proportional to the number of peers accessed
qin practice, this cost is very hard to notice
nforwarding is executed before the local buckets scan
exp-parmsg-edit exp-parmsg-vec

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•83
Similarity Queries Parallel Costs
nChanging k for k-NN queries
qlogarithmic scale
nHop count
qSince only few messages are forwarded, the k-NN queries have practically the same costs as the
range queries
qSmall amount of additional hops during the second phase
napproximately one additional hop is needed
knn-parmsg-edit knn-parmsg-vec

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•84
Similarity Queries Comparison
nk-NN and range queries
qlogarithmic scale
qrange query has the radius set to the distance of the k-th nearest object
nthat is the perfect estimate
nTotal distance computations
qthe k-NN query is slightly more expensive than the range query
nParallel distance computations
qclearly visible differences of the first phase and additional iteration(s)
knn-range-vec knn-range-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•85
Similarity Queries Parallel Costs
nGHT* real costs summary
qthe real response of the indexing system
nGHT* exhibits
qconstant parallel CPU costs
ndistance computations bounded by bucket capacity
qConstant parallel I/O costs
nnumber of buckets accessed bounded by peer capacity
qLogarithmic parallel communication costs
neven with the logarithmic replication

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•86
Data volume scalability
nDataset gradually expanded to 1,000,000 objects
qmeasurements after every increment of 2,000 objects
nIntraquery parallelism
qparallel response of a query measured in distance comp.
qmaximum of costs incurred at peers involved in the query
nInterquery parallelism
qsimplified by the ratio of the number of peers involved in a query to the total number of peers
qthe lower the ratio, the higher the chances for other queries to be executed in parallel

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•87
Data volume scalability
nChanging dataset size
qtwo different query radii
nIntraquery parallelism
qPractically constant responses
neven for the growing dataset
nsome irregularities for small datasets observed
qLarger radii result in higher costs
nthough, not much
exp-intraquery-vect exp-intraquery-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•88
Data volume scalability
nChanging dataset size
qtwo different k for k-NN
qcorresponding range queries
nIntraquery parallelism
qby analogy to range queries the responses are nearly constant
qThere is a small difference for different values of k
knn-intra-vec knn-intra-edit

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•89
Data volume scalability
nChanging dataset size
qTwo different query radii
nInterquery parallelism
qAs the size of the dataset increases, the interquery parallelism gets better
qBetter for the smaller radii
nsmaller percentage of peers involved in a query
exp-interquery-edit exp-interquery-vect

•Similarity Search:
•Part II, Chapter 5
•‹#›
•Similarity Search:
•Part II, Chapter 5
•90
Data volume scalability
nGHT* scalability for one query
qIntraquery parallelism
nboth the AST navigation and the bucket search
qRemains practically constant for growing datasets
nGHT* scalability for multiple queries
qInterquery parallelism
na simplification by percentage of used peers
qAllows more queries executed at the same time as the dataset grows