Searching in
Sparse Spaces
via Random-Projection Pivots
Background
● Sparse vector spaces
○ vectors v = (v1
, ..., vM
) of dimensionality M where just “a few” dimensions are non-zero
○ typically represented by a list of pairs (dim, value) for the non-zero dimensions
v = { (di
, vi
) }
● (Dis)similarity measures
○ common feature of majority of commonly used measures sim( u, v ) :
They consider only those dimensions that are non-zero in both vectors u and v
■ e.g. all measures that are based on inner product (doc product)
Example of Spaces: IR
● There are many such spaces widely used in Information Retrieval (IR)
t - a term from a dictionary of size M
d - document (sequence of terms)
q - query (sequence of terms)
- term frequency (number of occurrences of term t in document d )
- document frequency of term t (number of documents containing term t )
- inverted document frequency
- a way to weight terms t in documents (and query)
IR definitions & notation from: C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval. Cambridge University Press, 2008.
Vector Space Model in IR
Vector space model - documents and queries represented as sparse vectors
● ...of tf-idf scores (and variants, see below)
● Similarity by cosine
○ dot product:
○ Euclidean length:
● For normalized vectors
Variants of tf-idf Scoring
so called SMART notation: ddd.qqq e.g. lnc.ltc
but always they are sparse vectors with (normalized) dot product
● Weighting and
normalization factors
Probabilistic Ranking Models
● Standard in modern IR
● Theory behind: Estimation of the probability that a document di
is relevant to a query q (estimation of RSV - Retrieval Status Value)
● Let us begin with a
simple scoring
Ld
and Lave
are length of doc d and
average doc length, resp.
k1
, k3
and b are tuning parameters
This is famous BM25 weighting scheme
The core idea is still a “dot product”
in a sparse space
Objectives
● Efficient and accurate kNN retrieval in sparse spaces
with (dis)similarity functions based on dot product (non-zero dimensions)
● If the query is very short (1-3 terms), the best approach is inverted file
○ for every term t, keep a posting list of (coef, docid ) of all documents containing term t
○ given query q, scan posting lists of query terms and calculate individual document scores
○ if fact, better evaluation algorithms are used: static/dynamic pruning, block-WAND, etc.
1. For longer queries (query expansion, query-by-document), other solutions
might be more efficient
2. There are more complex sparse representations (+ similarities) for which
the posting lists for the query terms might not contain all relevant docs
○ bridge the vocabulary mismatch: IBM Model1 coefficients, word embeddings
From IR Scores to Distance Spaces
u · v = cos(u, v) · |u|·|v|
(u,v) = 1 - cos(u, v) - not metric (but gives the same ordering as cos(u,v ) )
(u,v) = arccos( cos(u, v) ) - angular distance
But, we don’t need triangle inequality for approximate search
... in fact, for some indexes, we don’t need to explicitly express a distance
Pivoting Techniques
● Pivoting techniques:
○ pre-select some pivots/anchors/reference objects (reference documents)
○ calculate (dis)similarity between objects (documents/queries) and the pivots
○ index & search data based on the the data-pivot similarities
■ PP-Index, M-Index, MI-File, PPP-Codes, NAPP, etc.
● NAPP inverted index (Neighborhood APProximation):
○ approximate the position of object x by set P(x) = K closest pivots to x (w/o their order)
○ given query q, the candidate set are all objects x s.t. |P(x)∩ P(q)|≥ s (share at least s pivots)
○ build an inverted index: for each pivot p keep list of objects x for which p∈ P(x)
○ use well-known IR algorithm(s) to identify the candidate set & then refine
[E. S. Tellez, E. Chavez, and G. Navarro, “Succinct nearest neighbor search,” Inf. Syst. 7 (38)]
Pivoting Techniques in Sparse Spaces
Naidan, Boytsov & Nyberg tried several kNN
approaches to different (non-metric) spaces
All tested techniques failed on the dataset with
tf-idf scores from 4.2M Wikipedia pages
with cosine similarity
[B. Naidan, L. Boytsov, and E. Nyberg, “Permutation Search
Methods are Efficient, Yet Faster Search is Possible,”
in Proceedings of VLDB 2015]
recall of 10-NN vs. fraction of
accessed objects using NAPP
Basic Insight into the Problem
● In sparse spaces, two random objects do not share many non-zero dims
● ...and if they do, these are “very common dimensions” (terms with low idf )
● Analysis of the wiki-sparse dataset (4.2M documents with tf-idf scores):
○ dictionary size M (dimensions): 100.000 Avg. number of non-zero dimensions: 156.0
○ number of overlapping dimensions = # of dimensions (terms) in both vectors: 3.81
○ number of overlapping dimensions after removing 1% of terms with lowest IDF: 0.95
○ analyze how many dimensions are between object and all K closest pivots
○ analyze how much energy is between an object and its closest pivots
■ “how much of the vector” is actually used for indexing
The Idea
● Generate pivots so that they use more information from the data:
1. Select pivots as long documents
2. Merge (put together) random documents
○ the same but filtering out terms with lowest IDF
3. k-means and then merge documents in the clusters
4. random dimensions!
○ each pivot = given number of randomly chosen dimensions (terms from the dictionary)
■ each of these selected dimensions is set to 1 (and then the vector is normalized)
○ surprisingly good results
Already Done
● first analysis and experiments
● proposal of an efficient structure to calculate similarity between an
object and all pivots at once
○ keep a hash map for all dimensions
○ for each dimension, store a list of pivots that contain this dimension
○ given an object vector, iterate over its non-zero dimensions in the map and accumulate
the similarity to all pivots
● efficient implementation within Leo & Bileg NMSLIB C++ library
● experiments and results used in
[1] L. Boytsov, D. Novak, Y. Malkov, and E. Nyberg, “Off the Beaten Path : Let ’ s Replace
Term-Based Retrieval with k-NN Search,” in CIKM, 2016, pp. 1099–1108.
○ without any details about the pivot-selection technique
What’s Next?
● Theory:
○ estimate the number of dimensions shared between a vector and pivot/pivots/closest pivots
○ derive probability-based properties why our approach works :-)
○ confirm these estimations experimentally
● Representations - find out how much information we loose when
○ we transform sparse vectors to full (our) pivot rankings + Spearman footrule (Kendall)
○ we transform to pivot ranking prefixes + modified Spearman (M-Index style)
○ we forget the order of the pivot ranking + Jaccard coefficient (NAPP style)
What’s Next?
● Effectiveness/efficiency experiments:
○ 3 (or 4) datasets, 2-3 representations (similarity spaces), queries of different lengths
● Baselines to compare with:
○ Lucene-like inverted index (different optimizations like WAND, block WAND, etc.)
○ Falconn: current best LSH technique for cosine distance
[Andoni, Indyk et al. Practical and Optimal LSH for Angular Distance. NIPS 2015]
○ k-NN graph (proximity graph): namely, SW-graph
Advantage of our approach: “works” for any (dis)similarity
○ indexing is based directly on the similarity function
● Write and submit a paper
Thanks for your attention. Questions, comments, please...