KD-Tree Implementation and Many
Dataset Customization
Erik Hasprunár
396122@mail.muni.cz
Motivation & Objectives
● Motivation:
– Similarity searching
● Computer vision(e.g. bag-of-words model)
● My (bachelor) work:
– Implementation of KD-tree
● to MESSIF library(Java)
– Indexing many datasets together
KD-Tree
● Space-partitioning data structure
● Binary tree
● Splits k-dimensional data each node based on
one chosen dimension
KD-Tree
● Example on 3-D data
● Number in bold is the chosen split dimension
Building KD-Tree
● Built recursively
● Choosing for every node:
– split dimension (dim)
– split threshold (t)
● Splits data info left and right subtree
– Lower data into left subtree (item[dim] < t)
– All others into right subtree (item[dim] >= t)
Split Dimension
● Originally, split dimensions were chosen
ciclically(1, 2, 3, 1...)
● In our implementation, we use VLFeat's
approach(VLFeat.org)
– Decision based on variance in all dimensions over
all data
– One of the few highest-variance dimensions is
chosen at random and set as split dimension
Split Threshold
● Can be mean or median
● Median approach is used by default
– balanced tree
● Data are then sorted by comparing the split
dimension values and are split into subtrees
Leaf Node
● Only 1 datum given to the next node → datum
is saved in the leaf node
– Last level of nodes
– After all leaf nodes are created, building of the tree
ends
Searching in KD-Tree
● k-NN query (k-nearest
neighbors) is
implemented
– Green dot = query
– Triangles = data
– Solid line circle
● k = 3
– Dashed line circle
● k = 5
k-NN search
● Searches recursively
● Priority queue of nodes is handled
● Lower-bounds are computed for each node in the
queue
– minimal possible-distance answer that can be found in the
node's subtree
● Search prioritizing based on the node's split dimension
– query[split dimension] < split threshold → left child node
is next, right child node is put into queue and vice versa
Leaf Node
● When the leaf node is reached, distance between the
query and the datum in the leaf is computed
● Distance is compared to the highest-distance answer
(found until now)
– If (distance > highest-answer distance) → vector is not
inserted
– If (distance < highest-answer distance) → vector is inserted
and the highest distance answer is thrown out
● After this, next node from priority queue is taken for
next searching etc.
Stop condition
● If (highest-distance of answers < lowest-bound
of nodes) → the search ends
● Stop condition is hardly achieved when the data
are of high dimensionality
– “Curse of dimensionality”
– Can be solved by approximation
Approximation
● As the distance computation is the most
costly operation, maximum number of these
can be preset
● Choosing the right number is a problem yet to
be solved(depends on how accurate the
answer is needed, number of data,
dimensionality etc.)
Randomized KD-Forest
● Same principle as KD-tree, but more trees
● Build:
– As the split dimensions are chosen at random(partly), all
trees are built differently
● Search:
– One priority queue of nodes handles all trees(“jumping
between trees”)
● Randomized forests have been found to be
effective at approximate searches
Many Dataset Experiment
● Experiment motivation:
– Build a tree with more independent datasets
● Datasets share the same data space
– Searching for k answers in every dataset (k *
number of datasets = number of all answers)
Many Dataset Customization
● Build:
– Same as regular KD-tree
+ Dataset id is remembered for each data
● Search:
– Same as regular KD-tree
+ Answers for every dataset are separate
+ Stop condition is calculated based on the highestanswer
distance from all datasets
Testing System Specifications
● Processor: AMD A4-3310mx
● OS: Win7 64-bit
● RAM: 4GB(3,48GB)
● IDE:
– Netbeans 8.0.1(Java)
– Microsoft Visual C++ 2008 Express Edition(C)
Mine(Java) vs VLFeat(C)
Search time in ms
0 5 10 15 20 25 30
Java
C
– Number of objects: 10 000
– K: 100
– Dimensionality: 128
– Number of trees: 1
– 10 000 Distance Computations for both
Approximation Test
1000 2000 3000 4000 5000 6000 7000 8000 9000 1000
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
NMRR
– Number of objects: 10 000
– K: 200
– Dimensionality: 128
– Number of trees: 1
Many Dataset Results
● No real statistics (sorry)
● Distance computations on low-dimension data
looks fine
● Operation time looks bad for now (possible bad
implementation)