Implementation Notes for the Soft Cosine Measure Vít Novotný Masaryk University Faculty of Informatics Brno, Czech Republic witiko@mail.muni.cz ABSTRACT The standard bag-of-words vector space model (vsm) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. [14] proposed the Soft Cosine Measure (scm) that incorporates term similarity relations. Charlet and Damnati [2] showed that the scm is highly effective in question answering (qa) systems. However, the orthonormalization algorithm proposed by Sidorov et al. [14] has an impractical time complexity of O(n4), where n is the size of the vocabulary. In this paper, we prove a tighter lower worst-case time complexity bound of O(n3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the scm in real-world information retrieval systems. KEYWORDS Vector Space Model, computational complexity, similarity measure ACM Reference Format: Vít Novotný. 2018. Implementation Notes for the Soft Cosine Measure. In The 27th ACM International Conference on Information and Knowledge Management (CIKM ’18), October 22–26, 2018, Torino, Italy. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3269206.3269317 1 INTRODUCTION The standard bag-of-words vector space model (vsm) [13] represents documents as real vectors. Documents are expressed in a basis where each basis vector corresponds to a single term, and each coordinate corresponds to the frequency of a term in a document. Consider the documents d1 = “When Antony found Julius Caesar dead”, and d2 = “I did enact Julius Caesar: I was killed i’ the Capitol” represented in a basis {αi }14 i=1 of R14, where the basis vectors correspond to the terms in the order of first appearance. Then the corresponding document vectors v1, and v2 would have the following CIKM ’18, October 22–26, 2018, Torino, Italy © 2018 Association for Computing Machinery. This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in The 27th ACM International Conference on Information and Knowledge Management (CIKM ’18), October 22–26, 2018, Torino, Italy, https://doi.org/10.1145/3269206.3269317. coordinates in α: (v1)α = [1 1 1 1 1 1 0 0 0 0 0 0 0 0]T , and (v2)α = [0 0 0 1 1 0 2 1 1 1 1 1 1 1]T . Assuming α is orthonormal, we can take the inner product of the ℓ2-normalized vectors v1, and v2 to measure the cosine of the angle (i.e. the cosine similarity) between the documents d1, and d2: ⟨v1/∥v1∥, v2/∥v2∥⟩ = (v1)α T (v2)α (v1)α T (v1)α (v2)α T (v2)α ≈ 0.23. Intuitively, this underestimates the true similarity between d1, andd2. Assuming α is orthogonal but not orthonormal, and that the terms Julius, and Caesar are twice as important as the other terms, we can construct a diagonal change-of-basis matrix W = (wij ) from α to an orthonormal basis β, where wii corresponds to the importance of a term i. This brings us closer to the true similarity: (v1)β = W(v1)α = [1 1 1 2 2 1 0 0 0 0 0 0 0 0]T , (v2)β = W(v2)α = [0 0 0 2 2 0 2 1 1 1 1 1 1 1]T , and ⟨v1/∥v1∥, v2/∥v2∥⟩ = W(v1)α T W(v2)α W(v1)α T W(v1)α W(v2)α T W(v2)α ≈ 0.53. Since we assume that the bases α and β are orthogonal, the terms dead and killed contribute nothing to the cosine similarity despite the clear synonymy, because ⟨βdead, βkilled⟩ = 0. In general, the vsm will underestimate the true similarity between documents that carry the same meaning but use different terminology. In this paper, we further develop the soft vsm described by Sidorov et al. [14], which does not assume α is orthogonal and which achieved state-of-the-art results on the question answering (qa) task at SemEval 2017 [2]. In Section 2, we review the previous work incorporating term similarity into the vsm. In Section 3, we restate the definition of the soft vsm and present several computational complexity results. In Section 4, we describe the implementation in vector databases and inverted indices. We conclude in Section 5 by summarizing our results and suggesting future work. 2 RELATED WORK Most works incorporating term similarity into the vsm published prior to Sidorov et al. [14] remain in an orthogonal coordinate system and instead propose novel document similarity measures. To name a few, Mikawa et al. [8] proposes the extended cosine measure, which introduces a metric matrix Q as a multiplicative factor in the cosine similarity formula. Q is the solution of an optimization arXiv:1808.09407v1[cs.IR]28Aug2018 problem to maximize the sum of extended cosine measures between each vector and the centroid of the vector’s category. Conveniently, the metric matrix Q can be used directly with the soft vsm, where it defines the inner product between basis vectors. Jimenez et al. [6] equip the multiset vsm with a soft cardinality operator that corresponds to cardinality, but takes term similarities into account. The notion of generalizing the vsm to non-orthogonal coordinate systems was perhaps first explored by Sidorov et al. [14] in the context of entrance exam question answering, where the basis vectors did not correspond directly to terms, but to n-grams constructed by following paths in syntactic trees. The authors derive the inner product of two basis vectors from the edit distance between the corresponding n-grams. Soft cosine measure (scm) is how they term the formula for computing the cosine similarity between two vectors expressed in a non-orthogonal basis. They also present an algorithm that computes a change-of-basis matrix to an orthonormal basis in time O(n4). We present an O(n3) algorithm in this paper. Charlet and Damnati [2] achieved state-of-the-art results at the qa task at SemEval 2017 [10] by training a document classifier on soft cosine measures between document passages. Unlike Sidorov et al. [14], Charlet and Damnati [2] already use basis vectors that correspond to terms rather than to n-grams. They derive the inner product of two basis vectors both from the edit distance between the corresponding terms, and from the inner product of the corresponding word2vec term embeddings [9]. 3 COMPUTATIONAL COMPLEXITY In this section, we restate the definition of the soft vsm as it was described by Sidorov et al. [14]. We then prove a tighter lower worst-case time complexity bound for computing a change-of-basis matrix to an orthonormal basis. We also prove that under certain assumptions, the inner product is a linear-time operation. Definition 3.1. Let Rn be the real n-space over R equipped with the bilinear inner product ⟨·, ·⟩. Let {αi }n i=1 be the basis of Rn in which we express our vectors. Let Wα = (wij ) be a diagonal changeof-basis matrix from α to a normalized basis {βi }n i=1 of Rn, i.e. ⟨βi, βj ⟩ ∈ [−1, 1], ⟨βi, βi ⟩ = 1. Let Sβ = (sij ) be the metric matrix of Rn w.r.t. β, i.e. sij = ⟨βi, βj ⟩. Then (Rn, Wα , Sβ ) is a soft vsm. Theorem 3.2. Let G = (Rn, Wα , Sβ ) be a soft vsm. Then a changeof-basis matrix E from the basis β to an orthonormal basis of Rn can be computed in time O(n3). Proof. By definition, S = EET for any change-of-basis matrix E from the basis β to an orthonormal basis. Since S contains inner products of linearly independent vectors β, it is Gramian and positive definite [5, p. 441]. The Gramianness of S also implies its symmetry. Therefore, a lower triangular E is uniquely determined by the Cholesky factorization of the symmetric positive-definite S, which we can compute in time O(n3) [15, p. 191]. □ Remark. See Table 1 for an experimental comparison. Although the vocabulary in our introductory example contains only n = 14 terms, n is in the millions for real-world corpora such as the English Wikipedia. Therefore, we generally need to store the n × n matrix S in a sparse format, so that it fits into main memory. Later, we will discuss how the density of S can be reduced, but Table 1: The real time to compute a matrix E from a dense matrix S averaged over 100 iterations. We used two Intel Xeon E5-2650 v2 (20M cache, 2.60 GHz) processors to evaluate the O(n3) Cholesky factorization from NumPy 1.14.3, and the O(n4) iterated Gaussian elimination from lapack. Forn > 1000, only sparse S seem practical. n terms Algorithm Real computation time 100 Cholesky factorization 0.0006 sec (0.606 ms) 100 Gaussian elimination 0.0529 sec (52.893 ms) 500 Cholesky factorization 0.0086 sec (8.640 ms) 500 Gaussian elimination 22.7361 sec (22.736 sec) 1000 Cholesky factorization 0.0304 sec (30.378 ms) 1000 Gaussian elimination 354.2746 sec (5.905 min) the Cholesky factor E can also be arbitrarily dense and therefore expensive to store. Given a permutation matrix P, we can instead factorize PTSP into FFT. Finding the permutation matrix P that minimizes the density of the Cholesky factor F is NP-hard [16], but heuristic stategies are known [3, 4]. Using the fact that PT = P−1, and basic facts about transpose, we can derive E = PF as follows: S = PPTSPPT = PFFTPT = PF(PF)T = EET. Lemma 3.3. Let G = (Rn, Wα , Sβ ) be a soft vsm. Let x, y ∈ Rn. Then ⟨x, y⟩ = (W(x)α )TSW(y)α . Proof. Let E be the change-of-basis matrix from the basis β to an orthonormal basis γ of Rn. Then: ⟨x, y⟩ = (x)γ T (y)γ = E(x)β T E(y)β = EW(x)α T EW(y)α = n i=1 (αi )γ · wii · (xi )α · n j=1 (α j )γ · wjj · (yj )α = n i=1 n j=1 wii · (xi )α · ⟨αi,α j ⟩ · wjj · (yj )α = n i=1 n j=1 wii · (xi )α · sij · wjj · (yj )α = W(x)α T SW(y)α . □ Remark. From here, we can directly derive the cosine of the angle between x and y (i.e. what Sidorov et al. [14] call the scm) as follows: ⟨x/∥x∥, y/∥y∥⟩ = W(x)α T SW(y)α W(x)α T SW(x)α W(y)α T SW(y)α . The scm is actually the starting point for Charlet and Damnati [2], who propose matrices S that are not necessarily metric. If, like them, we are only interested in computing the scm, then we only require that the square roots remain real, i.e. that x 0 =⇒ (W(x)α )TSW(x)α ≥ 0. For arbitrary x ∈ Rn, this holds iff S is positive semi-definite. However, since the coordinates (x)α correspond to non-negative term frequencies, it is sufficient that W and S are non-negative as well. If we are only interested in computing the inner product, then S can be arbitrary. Theorem 3.4. Let G = (Rn, Wα , Sβ ) be a soft vsm such that no column of S contains more than C non-zero elements, where C is a constant. Let x, y ∈ Rn and let m be the number of non-zero elements in (x)β . Then ⟨x, y⟩ can be computed in time O(m). Proof. Assume that (x)α , (y)α , and S are represented by data structures with constant-time column access and non-zero element traversal, e.g. compressed sparse column (csc) matrices. Further assume that W is represented by an array containing the main diagonal of W. Then Algorithm 1 computes W(x)α T SW(y)α in time O(m), which by Lemma 3.3, corresponds to ⟨x, y⟩. □ Algorithm 1 The inner product of x and y 1: r ← 0 2: for each i such that (xi )α is non-zero do ▷ = m iterations 3: for each j such that sij is non-zero do ▷ ≤ C iterations 4: r ← r + wii · (xi )α · sij · wjj · (yj )α 5: return r Remark. Similarly, we can show that if a column of S contains C non-zero elements on average, ⟨x, y⟩ has the average-case time complexity of O(m). Note also that most information retrieval systems impose a limit on the length of a query document. Therefore, m is usually bounded by a constant and O(m) = O(1). Since we are usually interested in the inner products of all document pairs in two corpora (e.g. one containing queries and the other actual documents), we can achieve significant speed improvements with vector processors by computing (WX)TSWY, where X, and Y are corpus matrices containing the coordinates of document vectors in the basis α as columns. To compute the scm, we first need to normalize the document vectors by performing an entrywise division of every column in X by diag (WX)TSWX = (WX)TS ◦ (WX)T, where ◦ denotes entrywise product. Y is normalized analogously. There are several strategies for making no column of S contain more thanC non-zero elements. If we do not require that S is metric (e.g. because we only wish to compute the inner product, or the scm), a simple strategy is to start with an empty matrix, and to insert the C −1 largest elements and the diagonal element from every column of S. However, the resulting matrix will likely be asymmetric, which makes the inner product formula asymmetric as well. We can regain symmetry by always inserting an element sij together with the element sji and only if this does not make the column j contain more than C non-zero elements. This strategy is greedy, since later columns contain non-zero elements inserted by earlier columns. Our preliminary experiments suggest that processing colums that correspond to increasingly frequent terms performs best on the task of Charlet and Damnati [2]. Finally, by limiting the sum of all non-diagonal elements in a column to be less than one, we can make S strictly diagonally dominant and therefore positive definite, which enables us to compute E through Cholesky factorization. 4 IMPLEMENTATION IN VECTOR DATABASES AND INVERTED INDICES In this section, we present coordinate transformations for retrieving nearest document vectors according to the inner product, and the soft cosine measure from general-purpose vector databases such as Annoy, or Faiss [7]. We also discuss the implementation in the inverted indices of text search engines such as Apache Lucene [1]. Remark. With a vector database, we can transform document vectors to an orthonormal basis γ. In the transformed coordinates, the dot product ((x)γ )T(y)γ corresponds to the inner product ⟨x, y⟩ and the cosine similarity corresponds to the cosine of an angle ⟨x/∥x∥, y/∥y∥⟩ (i.e. the soft cosine measure). A vector database that supports nearest neighbor search according to either the dot product, or the cosine similarity will therefore retrieve vectors expressed in γ according to either the inner product, or the soft cosine measure. We can compute a change-of-basis matrix E of order n in time O(n3) by Theorem 3.2 and use it to transform every vector x ∈ Rn toγ by computing EW(x)α . However, this approach requires that S is symmetric positive-definite and that we recompute E, and reindex the vector database each time S has changed. We will now discuss transformations that do not require E and for which a non-negative S is sufficient as discussed in the remark for Lemma 3.3. Theorem 4.1. Let G = (Rn, Wα , Sβ ) be a soft vsm. Let x, x′, y ∈ Rn such that (x′)β = ST(x)β . Then ⟨x, y⟩ = ((x′)β )T(y)β . Proof. (x′)β T (y)β = (x)β T S(y)β = ⟨x, y⟩ from Lemma 3.3. □ Remark. By transforming a query vector x into (x′)β , we can retrieve documents according to the inner product in vector databases that only support nearest neighbor search according to the dot product. Note that we do not introduce S into (y)β , which allows us to change S without changing the documents in a vector database and that S can be arbitrary as discussed in the remark for Lemma 3.3. Theorem 4.2. LetG = (Rn, Wα , Sβ ) be a soft vsm. Let x, x′, y, y′, z, z′ ∈ Rn s.t. x, y, z 0, (x′)β = ST(x)β , (y′)β = (y)β (y)β T S(y)β , and (z′)β = (z)β (z)β T S(z)β . Then ⟨x/∥x∥, y/∥y∥⟩ ≤ ⟨x/∥x∥, z/∥z∥⟩ iff (x′)β T (y′)β ≤ (x′)β T (z′)β . Proof. (x′)β T (y′)β = ((x)β )TS(y)β ((y)β )TS(y)β . From Lemma 3.3, this equals ⟨x/∥x∥, y/∥y∥⟩ except for the missing term (x)β T S(x)β in the divisor. The term is constant in both ⟨x/∥x∥, y/∥y∥⟩, and ⟨x/∥x∥, z/∥z∥⟩, so ordering is preserved. □ Remark. By transforming a query vector x into (x′)β and document vectors y into (y′)β , we can retrieve documents according to the scm in vector databases that only support nearest neighbor search according to the dot product. Theorem 4.3. Let G = (Rn, Wα , Sβ ) be a soft vsm s.t. Sβ is non-negative. Let x, y, y′, z, z′ ∈ Rn, and x′, y′′, z′′ ∈ Rn+1 s.t. x 0, y, z > 0, (x′)β′ = ST(x)β ST(x)β T ST(x)β 0 T , (y′)β = (y)β (y)β T S(y)β , (y′′)β′ = (y′)β T 1 − (y′)β T (y′)β T , (z′)β = (z)β (z)β T S(z)β , and (z′′)β′ = (z′)β T 1 − (z′)β T (z′)β T , where β′ = β ∪ {[0 . . . 0 1]T ∈ Rn+1}. Then ⟨x/∥x∥, y/∥y∥⟩ ≤ ⟨x/∥x∥, z/∥z∥⟩ iff (x′)β′ T (y′′)β′ (x′)β′ T (x′)β′ (y′′)β′ T (y′′)β′ ≤ (x′)β′ T (z′′)β′ (x′)β′ T (x′)β′ (z′′)β′ T (z′′)β′ . Proof. (x′)β′ T (x′)β′ = 1. Since S is non-negative, and (y)β > 0, (y)β T S(y)β ≥ (y)β T (y)β and therefore (y′)β′ T (y′)β′ ≤ 1, and (y′′)β′ T (y′′)β′ = 1 [11, sec. 4.2]. Therefore: (x′)β′ T (y′′)β′ (x′)β′ T (x′)β′ (y′′)β′ T (y′′)β′ = (x′ )β′ T (y′′ )β′ = (x)β T S(y)β ST(x)β T ST(x)β (y)β T S(y)β . From Lemma 3.3, this equals ⟨x/∥x∥, y/∥y∥⟩ except for the missing term (x)β T S(x)β , and the extra term ST(x)β T ST(x)β in the divisor. The terms are constant in both ⟨x/∥x∥, y/∥y∥⟩, and ⟨x/∥x∥, z/∥z∥⟩, so ordering is preserved. □ Remark. By transforming a query vector x into (x′)β′ and document vectors y into (y′′)β′ , we can retrieve documents according to the scm in vector databases that only support nearest neighbor search according to the cosine similarity. Whereas most vector databases are designed for storing lowdimensional and dense vector coordinates, document vectors have the dimension n, which can be in the millions for real-world corpora such as the English Wikipedia. Apart from that, a document contains only a small fraction of the terms in the vocabulary, which makes the coordinates extremely sparse. Therefore, the coordinates need to be converted to a dense low-dimensional representation, using e.g. the latent semantic analysis (lsa), before they are stored in a vector database or used for queries. Unlike vector databases, inverted-index-based search engines are built around a data structure called the inverted index, which maps each term in our vocabulary to a list of documents (a posting) containing the term. Documents in a posting are sorted by a common criterion. The search engine tokenizes a text query into terms, retrieves postings for the query terms, and then traverses the postings, computing similarity between the query and the documents. We can directly replace the search engine’s document similarity formula with the formula for the inner product from Lemma 3.3, or the formula for the scm. After this straightforward change, the system will still only retrieve documents that have at least one term in common with the query. Therefore, we first need to expand the query vector x by computing ((x)β )TS and retrieving postings for all terms corresponding to the nonzero coordinates in the expanded vector. The expected number of these terms is O(mC), where m is the number of non-zero elements in (x)α , and C is the maximum number of non-zero elements in any column of S. Assuming m and C are bounded by a constant, O(mC) = O(1). 5 CONCLUSION AND FUTURE WORK In this paper, we examined the soft vector space model (vsm) of Sidorov et al. [14]. We restated the definition, we proved a tighter lower time complexity bound of O(n3) for a related orthonormalization problem, and we showed how the inner product, and the soft cosine measure between document vectors can be efficiently computed in general-purpose vector databases, in the inverted indices of text search engines, and in other applications. To complement this paper, we also provided an implementation of the scm to Gensim1 [12], a free open-source natural language processing library. In our remarks for Theorem 3.4, we discuss strategies for making no column of matrix S contain more than C non-zero elements. Future research will evaluate their performance on the semantic text similarity task with public datasets. Various choices of the matrix S based on word embeddings, Levenshtein distance, thesauri, and statistical regression as well as metric matrices from previous work [8] will also be evaluated both amongst themselves and against other document similarity measures such as the lda, lsa, and wmd. Acknowledgements. We gratefully acknowledge the support by tačr under the Omega program, project td03000295. We also sincerely thank three anonymous reviewers for their insightful comments. REFERENCES [1] Andrzej Białecki et al. 2012. Apache Lucene 4. In SIGIR 2012 Workshop on Open Source Information Retrieval, 17. [2] Delphine Charlet and Geraldine Damnati. 2017. SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering. In Proc. of the 11th International Workshop on Semantic Evaluation (SemEval-2017). ACL, Vancouver, Canada, 315–319. doi: 10.18653/v1/S17-2051. [3] Elizabeth Cuthill and James McKee. 1969. Reducing the bandwidth of sparse symmetric matrices. In Proc. of the 1969 24th National Conference (ACM ’69). ACM, 157–172. doi: 10.1145/800195.805928. [4] Pinar Heggernes et al. 2001. The Computational Complexity of the Minimum Degree algorithm. Tech. rep. Institute for Computer Applications in Science and Engineering, Hampton VA, (Dec. 2001). https://www.cs.purdue.edu/ homes/apothen/Papers/md-conf.pdf. [5] Roger A. Horn and Charles R. Johnson. 2013. Matrix Analysis. (Second ed.). CUP, 662. isbn: 978-0521548236. [6] Sergio Jimenez et al. 2012. Soft Cardinality: A Parameterized Similarity Function for Text Comparison. In Proc. of the 1st Joint Conference on Lexical and Computational Semantics – Volume 1: Proc. of the Main Conference and the Shared Task, and Volume 2: Proc. of the 6th Int. Workshop on Semantic Evaluation (SemEval ’12). ACL. Montreal, Canada, 449–453. http://dl.acm.org/citation. cfm?id=2387636.2387709. [7] Jeff Johnson et al. 2017. Billion-scale similarity search with GPUs. ArXiv e-prints, (Feb. 2017). arXiv: 1702.08734 [cs.CV]. [8] Kenta Mikawa et al. 2011. A proposal of extended cosine measure for distance metric learning in text classification. In Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on. IEEE, 1741–1746. [9] Tomáš Mikolov et al. 2013. Efficient Estimation of Word Representations in Vector Space. ArXiv e-prints, (Jan. 2013). arXiv: 1301.3781 [cs.CL]. [10] Preslav Nakov et al. 2017. SemEval-2017 task 3: community question answering. In Proc. of the 11th International Workshop on Semantic Evaluation (SemEval ’17). ACL, Vancouver, Canada, (Aug. 2017), 27–48. [11] Behnam Neyshabur and Nathan Srebro. 2015. On Symmetric and Asymmetric LSHs for Inner Product Search. In Proc. of the 32nd Int. Conference on Machine Learning (ICML’15). Vol. 37. JMLR.org, Lille, France, 1926–1934. http://dl.acm. org/citation.cfm?id=3045118.3045323. [12] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. English. In Proc. of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, (May 2010), 45–50. http://is.muni.cz/publication/884893/en. [13] Gerard Salton and Chris Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval. Inform. Processing and Management, 24, 513–523, 5. [14] Grigori Sidorov et al. 2014. Soft similarity and soft cosine measure: Similarity of features in vector space model. Computación y Sistemas, 18, 3, 491–504. [15] G.W. Stewart. 1998. Matrix Algorithms: Volume 1: Basic Decompositions. Other Titles in Applied Mathematics. SIAM, 458. isbn: 9781611971408. [16] Mihalis Yannakakis. 1981. Computing the minimum fill-in is NP-complete. SIAM Journal on Algebraic Discrete Methods, 2, 1, 77–79. 1See https://github.com/RaRe-Technologies/gensim/, pull requests 1827, and 2016.