Implementation Notes for the Soft Cosine Measure
Vít Novotný
Masaryk University
Faculty of Informatics
Brno, Czech Republic
witiko@mail.muni.cz
ABSTRACT
The standard bag-of-words vector space model (vsm) is efficient,
and ubiquitous in information retrieval, but it underestimates the
similarity of documents with the same meaning, but different terminology.
To overcome this limitation, Sidorov et al. [14] proposed the
Soft Cosine Measure (scm) that incorporates term similarity relations.
Charlet and Damnati [2] showed that the scm is highly effective
in question answering (qa) systems. However, the orthonormalization
algorithm proposed by Sidorov et al. [14] has an impractical
time complexity of O(n4), where n is the size of the vocabulary.
In this paper, we prove a tighter lower worst-case time complexity
bound of O(n3). We also present an algorithm for computing
the similarity between documents and we show that its worst-case
time complexity is O(1) given realistic conditions. Lastly, we describe
implementation in general-purpose vector databases such as
Annoy, and Faiss and in the inverted indices of text search engines
such as Apache Lucene, and ElasticSearch. Our results enable the
deployment of the scm in real-world information retrieval systems.
KEYWORDS
Vector Space Model, computational complexity, similarity measure
ACM Reference Format:
Vít Novotný. 2018. Implementation Notes for the Soft Cosine Measure.
In The 27th ACM International Conference on Information and Knowledge
Management (CIKM ’18), October 22–26, 2018, Torino, Italy. ACM, New York,
NY, USA, 4 pages. https://doi.org/10.1145/3269206.3269317
1 INTRODUCTION
The standard bag-of-words vector space model (vsm) [13] represents
documents as real vectors. Documents are expressed in a basis
where each basis vector corresponds to a single term, and each
coordinate corresponds to the frequency of a term in a document.
Consider the documents
d1 = “When Antony found Julius Caesar dead”, and
d2 = “I did enact Julius Caesar: I was killed i’ the Capitol”
represented in a basis {αi }14
i=1 of R14, where the basis vectors correspond
to the terms in the order of first appearance. Then the corresponding
document vectors v1, and v2 would have the following
CIKM ’18, October 22–26, 2018, Torino, Italy
© 2018 Association for Computing Machinery.
This is the author’s version of the work. It is posted here for your personal use.
Not for redistribution. The definitive Version of Record was published in The 27th
ACM International Conference on Information and Knowledge Management (CIKM ’18),
October 22–26, 2018, Torino, Italy, https://doi.org/10.1145/3269206.3269317.
coordinates in α:
(v1)α = [1 1 1 1 1 1 0 0 0 0 0 0 0 0]T
, and
(v2)α = [0 0 0 1 1 0 2 1 1 1 1 1 1 1]T
.
Assuming α is orthonormal, we can take the inner product of the
ℓ2-normalized vectors v1, and v2 to measure the cosine of the angle
(i.e. the cosine similarity) between the documents d1, and d2:
⟨v1/∥v1∥, v2/∥v2∥⟩ =
(v1)α
T
(v2)α
(v1)α
T
(v1)α (v2)α
T
(v2)α
≈ 0.23.
Intuitively, this underestimates the true similarity between d1,
andd2. Assuming α is orthogonal but not orthonormal, and that the
terms Julius, and Caesar are twice as important as the other terms,
we can construct a diagonal change-of-basis matrix W = (wij )
from α to an orthonormal basis β, where wii corresponds to the
importance of a term i. This brings us closer to the true similarity:
(v1)β = W(v1)α = [1 1 1 2 2 1 0 0 0 0 0 0 0 0]T
,
(v2)β = W(v2)α = [0 0 0 2 2 0 2 1 1 1 1 1 1 1]T
, and
⟨v1/∥v1∥, v2/∥v2∥⟩
=
W(v1)α
T
W(v2)α
W(v1)α
T
W(v1)α W(v2)α
T
W(v2)α
≈ 0.53.
Since we assume that the bases α and β are orthogonal, the terms
dead and killed contribute nothing to the cosine similarity despite
the clear synonymy, because ⟨βdead, βkilled⟩ = 0. In general, the
vsm will underestimate the true similarity between documents that
carry the same meaning but use different terminology.
In this paper, we further develop the soft vsm described by
Sidorov et al. [14], which does not assume α is orthogonal and
which achieved state-of-the-art results on the question answering
(qa) task at SemEval 2017 [2]. In Section 2, we review the
previous work incorporating term similarity into the vsm. In Section
3, we restate the definition of the soft vsm and present several
computational complexity results. In Section 4, we describe the implementation
in vector databases and inverted indices. We conclude
in Section 5 by summarizing our results and suggesting future work.
2 RELATED WORK
Most works incorporating term similarity into the vsm published
prior to Sidorov et al. [14] remain in an orthogonal coordinate
system and instead propose novel document similarity measures. To
name a few, Mikawa et al. [8] proposes the extended cosine measure,
which introduces a metric matrix Q as a multiplicative factor in
the cosine similarity formula. Q is the solution of an optimization
arXiv:1808.09407v1[cs.IR]28Aug2018
problem to maximize the sum of extended cosine measures between
each vector and the centroid of the vector’s category. Conveniently,
the metric matrix Q can be used directly with the soft vsm, where
it defines the inner product between basis vectors. Jimenez et al.
[6] equip the multiset vsm with a soft cardinality operator that
corresponds to cardinality, but takes term similarities into account.
The notion of generalizing the vsm to non-orthogonal coordinate
systems was perhaps first explored by Sidorov et al. [14] in the context
of entrance exam question answering, where the basis vectors
did not correspond directly to terms, but to n-grams constructed
by following paths in syntactic trees. The authors derive the inner
product of two basis vectors from the edit distance between the corresponding
n-grams. Soft cosine measure (scm) is how they term the
formula for computing the cosine similarity between two vectors
expressed in a non-orthogonal basis. They also present an algorithm
that computes a change-of-basis matrix to an orthonormal
basis in time O(n4). We present an O(n3) algorithm in this paper.
Charlet and Damnati [2] achieved state-of-the-art results at the
qa task at SemEval 2017 [10] by training a document classifier on
soft cosine measures between document passages. Unlike Sidorov
et al. [14], Charlet and Damnati [2] already use basis vectors that
correspond to terms rather than to n-grams. They derive the inner
product of two basis vectors both from the edit distance between
the corresponding terms, and from the inner product of the corresponding
word2vec term embeddings [9].
3 COMPUTATIONAL COMPLEXITY
In this section, we restate the definition of the soft vsm as it was
described by Sidorov et al. [14]. We then prove a tighter lower
worst-case time complexity bound for computing a change-of-basis
matrix to an orthonormal basis. We also prove that under certain
assumptions, the inner product is a linear-time operation.
Definition 3.1. Let Rn be the real n-space over R equipped with
the bilinear inner product ⟨·, ·⟩. Let {αi }n
i=1 be the basis of Rn in
which we express our vectors. Let Wα = (wij ) be a diagonal changeof-basis
matrix from α to a normalized basis {βi }n
i=1 of Rn, i.e.
⟨βi, βj ⟩ ∈ [−1, 1], ⟨βi, βi ⟩ = 1. Let Sβ = (sij ) be the metric matrix
of Rn w.r.t. β, i.e. sij = ⟨βi, βj ⟩. Then (Rn, Wα , Sβ ) is a soft vsm.
Theorem 3.2. Let G = (Rn, Wα , Sβ ) be a soft vsm. Then a changeof-basis
matrix E from the basis β to an orthonormal basis of Rn can
be computed in time O(n3).
Proof. By definition, S = EET for any change-of-basis matrix
E from the basis β to an orthonormal basis. Since S contains inner
products of linearly independent vectors β, it is Gramian and
positive definite [5, p. 441]. The Gramianness of S also implies its
symmetry. Therefore, a lower triangular E is uniquely determined
by the Cholesky factorization of the symmetric positive-definite S,
which we can compute in time O(n3) [15, p. 191]. □
Remark. See Table 1 for an experimental comparison.
Although the vocabulary in our introductory example contains
only n = 14 terms, n is in the millions for real-world corpora such
as the English Wikipedia. Therefore, we generally need to store the
n × n matrix S in a sparse format, so that it fits into main memory.
Later, we will discuss how the density of S can be reduced, but
Table 1: The real time to compute a matrix E from a dense matrix S
averaged over 100 iterations. We used two Intel Xeon E5-2650 v2
(20M cache, 2.60 GHz) processors to evaluate the O(n3) Cholesky
factorization from NumPy 1.14.3, and the O(n4) iterated Gaussian
elimination from lapack. Forn > 1000, only sparse S seem practical.
n terms Algorithm Real computation time
100 Cholesky factorization 0.0006 sec (0.606 ms)
100 Gaussian elimination 0.0529 sec (52.893 ms)
500 Cholesky factorization 0.0086 sec (8.640 ms)
500 Gaussian elimination 22.7361 sec (22.736 sec)
1000 Cholesky factorization 0.0304 sec (30.378 ms)
1000 Gaussian elimination 354.2746 sec (5.905 min)
the Cholesky factor E can also be arbitrarily dense and therefore
expensive to store. Given a permutation matrix P, we can instead
factorize PTSP into FFT. Finding the permutation matrix P that minimizes
the density of the Cholesky factor F is NP-hard [16], but
heuristic stategies are known [3, 4]. Using the fact that PT = P−1,
and basic facts about transpose, we can derive E = PF as follows:
S = PPTSPPT = PFFTPT = PF(PF)T = EET.
Lemma 3.3. Let G = (Rn, Wα , Sβ ) be a soft vsm. Let x, y ∈ Rn.
Then ⟨x, y⟩ = (W(x)α )TSW(y)α .
Proof. Let E be the change-of-basis matrix from the basis β to
an orthonormal basis γ of Rn. Then:
⟨x, y⟩ = (x)γ
T
(y)γ = E(x)β
T
E(y)β = EW(x)α
T
EW(y)α
=
n
i=1
(αi )γ · wii · (xi )α ·
n
j=1
(α j )γ · wjj · (yj )α
=
n
i=1
n
j=1
wii · (xi )α · ⟨αi,α j ⟩ · wjj · (yj )α
=
n
i=1
n
j=1
wii · (xi )α · sij · wjj · (yj )α = W(x)α
T
SW(y)α . □
Remark. From here, we can directly derive the cosine of the angle
between x and y (i.e. what Sidorov et al. [14] call the scm) as follows:
⟨x/∥x∥, y/∥y∥⟩ =
W(x)α
T
SW(y)α
W(x)α
T
SW(x)α W(y)α
T
SW(y)α
.
The scm is actually the starting point for Charlet and Damnati
[2], who propose matrices S that are not necessarily metric. If, like
them, we are only interested in computing the scm, then we only
require that the square roots remain real, i.e. that x 0 =⇒
(W(x)α )TSW(x)α ≥ 0. For arbitrary x ∈ Rn, this holds iff S is positive
semi-definite. However, since the coordinates (x)α correspond
to non-negative term frequencies, it is sufficient that W and S are
non-negative as well. If we are only interested in computing the
inner product, then S can be arbitrary.
Theorem 3.4. Let G = (Rn, Wα , Sβ ) be a soft vsm such that no
column of S contains more than C non-zero elements, where C is a
constant. Let x, y ∈ Rn and let m be the number of non-zero elements
in (x)β . Then ⟨x, y⟩ can be computed in time O(m).
Proof. Assume that (x)α , (y)α , and S are represented by data
structures with constant-time column access and non-zero element
traversal, e.g. compressed sparse column (csc) matrices. Further
assume that W is represented by an array containing the main
diagonal of W. Then Algorithm 1 computes W(x)α
T
SW(y)α in
time O(m), which by Lemma 3.3, corresponds to ⟨x, y⟩. □
Algorithm 1 The inner product of x and y
1: r ← 0
2: for each i such that (xi )α is non-zero do ▷ = m iterations
3: for each j such that sij is non-zero do ▷ ≤ C iterations
4: r ← r + wii · (xi )α · sij · wjj · (yj )α
5: return r
Remark. Similarly, we can show that if a column of S contains C
non-zero elements on average, ⟨x, y⟩ has the average-case time
complexity of O(m). Note also that most information retrieval systems
impose a limit on the length of a query document. Therefore,
m is usually bounded by a constant and O(m) = O(1).
Since we are usually interested in the inner products of all document
pairs in two corpora (e.g. one containing queries and the other
actual documents), we can achieve significant speed improvements
with vector processors by computing (WX)TSWY, where X, and Y
are corpus matrices containing the coordinates of document vectors
in the basis α as columns. To compute the scm, we first need to normalize
the document vectors by performing an entrywise division
of every column in X by diag (WX)TSWX = (WX)TS ◦ (WX)T,
where ◦ denotes entrywise product. Y is normalized analogously.
There are several strategies for making no column of S contain
more thanC non-zero elements. If we do not require that S is metric
(e.g. because we only wish to compute the inner product, or the scm),
a simple strategy is to start with an empty matrix, and to insert the
C −1 largest elements and the diagonal element from every column
of S. However, the resulting matrix will likely be asymmetric, which
makes the inner product formula asymmetric as well. We can regain
symmetry by always inserting an element sij together with the
element sji and only if this does not make the column j contain
more than C non-zero elements. This strategy is greedy, since later
columns contain non-zero elements inserted by earlier columns.
Our preliminary experiments suggest that processing colums that
correspond to increasingly frequent terms performs best on the
task of Charlet and Damnati [2]. Finally, by limiting the sum of
all non-diagonal elements in a column to be less than one, we can
make S strictly diagonally dominant and therefore positive definite,
which enables us to compute E through Cholesky factorization.
4 IMPLEMENTATION IN VECTOR
DATABASES AND INVERTED INDICES
In this section, we present coordinate transformations for retrieving
nearest document vectors according to the inner product, and the
soft cosine measure from general-purpose vector databases such
as Annoy, or Faiss [7]. We also discuss the implementation in the
inverted indices of text search engines such as Apache Lucene [1].
Remark. With a vector database, we can transform document vectors
to an orthonormal basis γ. In the transformed coordinates, the
dot product ((x)γ )T(y)γ corresponds to the inner product ⟨x, y⟩
and the cosine similarity corresponds to the cosine of an angle
⟨x/∥x∥, y/∥y∥⟩ (i.e. the soft cosine measure). A vector database that
supports nearest neighbor search according to either the dot product,
or the cosine similarity will therefore retrieve vectors expressed
in γ according to either the inner product, or the soft cosine measure.
We can compute a change-of-basis matrix E of order n in time
O(n3) by Theorem 3.2 and use it to transform every vector x ∈ Rn
toγ by computing EW(x)α . However, this approach requires that S
is symmetric positive-definite and that we recompute E, and reindex
the vector database each time S has changed. We will now discuss
transformations that do not require E and for which a non-negative
S is sufficient as discussed in the remark for Lemma 3.3.
Theorem 4.1. Let G = (Rn, Wα , Sβ ) be a soft vsm. Let x, x′, y ∈
Rn such that (x′)β = ST(x)β . Then ⟨x, y⟩ = ((x′)β )T(y)β .
Proof. (x′)β
T
(y)β = (x)β
T
S(y)β = ⟨x, y⟩ from Lemma 3.3. □
Remark. By transforming a query vector x into (x′)β , we can retrieve
documents according to the inner product in vector databases
that only support nearest neighbor search according to the dot product.
Note that we do not introduce S into (y)β , which allows us to
change S without changing the documents in a vector database and
that S can be arbitrary as discussed in the remark for Lemma 3.3.
Theorem 4.2. LetG = (Rn, Wα , Sβ ) be a soft vsm. Let x, x′, y, y′,
z, z′ ∈ Rn s.t. x, y, z 0, (x′)β = ST(x)β , (y′)β =
(y)β
(y)β
T
S(y)β
,
and (z′)β =
(z)β
(z)β
T
S(z)β
. Then ⟨x/∥x∥, y/∥y∥⟩ ≤ ⟨x/∥x∥, z/∥z∥⟩
iff (x′)β
T
(y′)β ≤ (x′)β
T
(z′)β .
Proof. (x′)β
T
(y′)β =
((x)β )TS(y)β
((y)β )TS(y)β
. From Lemma 3.3, this
equals ⟨x/∥x∥, y/∥y∥⟩ except for the missing term (x)β
T
S(x)β
in the divisor. The term is constant in both ⟨x/∥x∥, y/∥y∥⟩, and
⟨x/∥x∥, z/∥z∥⟩, so ordering is preserved. □
Remark. By transforming a query vector x into (x′)β and document
vectors y into (y′)β , we can retrieve documents according to the
scm in vector databases that only support nearest neighbor search
according to the dot product.
Theorem 4.3. Let G = (Rn, Wα , Sβ ) be a soft vsm s.t. Sβ is
non-negative. Let x, y, y′, z, z′ ∈ Rn, and x′, y′′, z′′ ∈ Rn+1 s.t. x
0, y, z > 0, (x′)β′ =
ST(x)β
ST(x)β
T
ST(x)β
0
T
, (y′)β =
(y)β
(y)β
T
S(y)β
,
(y′′)β′ = (y′)β
T
1 − (y′)β
T
(y′)β
T
, (z′)β =
(z)β
(z)β
T
S(z)β
,
and (z′′)β′ = (z′)β
T
1 − (z′)β
T
(z′)β
T
, where β′
= β ∪
{[0 . . . 0 1]T ∈ Rn+1}. Then ⟨x/∥x∥, y/∥y∥⟩ ≤ ⟨x/∥x∥, z/∥z∥⟩ iff
(x′)β′
T
(y′′)β′
(x′)β′
T
(x′)β′ (y′′)β′
T
(y′′)β′
≤
(x′)β′
T
(z′′)β′
(x′)β′
T
(x′)β′ (z′′)β′
T
(z′′)β′
.
Proof. (x′)β′
T
(x′)β′ = 1. Since S is non-negative, and (y)β >
0, (y)β
T
S(y)β ≥ (y)β
T
(y)β and therefore (y′)β′
T
(y′)β′ ≤
1, and (y′′)β′
T
(y′′)β′ = 1 [11, sec. 4.2]. Therefore:
(x′)β′
T
(y′′)β′
(x′)β′
T
(x′)β′ (y′′)β′
T
(y′′)β′
= (x′
)β′
T
(y′′
)β′
=
(x)β
T
S(y)β
ST(x)β
T
ST(x)β (y)β
T
S(y)β
.
From Lemma 3.3, this equals ⟨x/∥x∥, y/∥y∥⟩ except for the missing
term (x)β
T
S(x)β , and the extra term ST(x)β
T
ST(x)β in
the divisor. The terms are constant in both ⟨x/∥x∥, y/∥y∥⟩, and
⟨x/∥x∥, z/∥z∥⟩, so ordering is preserved. □
Remark. By transforming a query vector x into (x′)β′ and document
vectors y into (y′′)β′ , we can retrieve documents according to the
scm in vector databases that only support nearest neighbor search
according to the cosine similarity.
Whereas most vector databases are designed for storing lowdimensional
and dense vector coordinates, document vectors have
the dimension n, which can be in the millions for real-world corpora
such as the English Wikipedia. Apart from that, a document
contains only a small fraction of the terms in the vocabulary, which
makes the coordinates extremely sparse. Therefore, the coordinates
need to be converted to a dense low-dimensional representation,
using e.g. the latent semantic analysis (lsa), before they are stored
in a vector database or used for queries.
Unlike vector databases, inverted-index-based search engines
are built around a data structure called the inverted index, which
maps each term in our vocabulary to a list of documents (a posting)
containing the term. Documents in a posting are sorted by a common
criterion. The search engine tokenizes a text query into terms,
retrieves postings for the query terms, and then traverses the postings,
computing similarity between the query and the documents.
We can directly replace the search engine’s document similarity
formula with the formula for the inner product from Lemma 3.3,
or the formula for the scm. After this straightforward change, the
system will still only retrieve documents that have at least one term
in common with the query. Therefore, we first need to expand the
query vector x by computing ((x)β )TS and retrieving postings for
all terms corresponding to the nonzero coordinates in the expanded
vector. The expected number of these terms is O(mC), where m is
the number of non-zero elements in (x)α , and C is the maximum
number of non-zero elements in any column of S. Assuming m and
C are bounded by a constant, O(mC) = O(1).
5 CONCLUSION AND FUTURE WORK
In this paper, we examined the soft vector space model (vsm) of
Sidorov et al. [14]. We restated the definition, we proved a tighter
lower time complexity bound of O(n3) for a related orthonormalization
problem, and we showed how the inner product, and the soft
cosine measure between document vectors can be efficiently computed
in general-purpose vector databases, in the inverted indices
of text search engines, and in other applications. To complement
this paper, we also provided an implementation of the scm to Gensim1
[12], a free open-source natural language processing library.
In our remarks for Theorem 3.4, we discuss strategies for making
no column of matrix S contain more than C non-zero elements.
Future research will evaluate their performance on the semantic
text similarity task with public datasets. Various choices of the
matrix S based on word embeddings, Levenshtein distance, thesauri,
and statistical regression as well as metric matrices from previous
work [8] will also be evaluated both amongst themselves and against
other document similarity measures such as the lda, lsa, and wmd.
Acknowledgements. We gratefully acknowledge the support by tačr
under the Omega program, project td03000295. We also sincerely
thank three anonymous reviewers for their insightful comments.
REFERENCES
[1] Andrzej Białecki et al. 2012. Apache Lucene 4. In SIGIR 2012 Workshop on Open
Source Information Retrieval, 17.
[2] Delphine Charlet and Geraldine Damnati. 2017. SimBow at SemEval-2017 Task
3: Soft-Cosine Semantic Similarity between Questions for Community Question
Answering. In Proc. of the 11th International Workshop on Semantic Evaluation
(SemEval-2017). ACL, Vancouver, Canada, 315–319. doi: 10.18653/v1/S17-2051.
[3] Elizabeth Cuthill and James McKee. 1969. Reducing the bandwidth of sparse
symmetric matrices. In Proc. of the 1969 24th National Conference (ACM ’69).
ACM, 157–172. doi: 10.1145/800195.805928.
[4] Pinar Heggernes et al. 2001. The Computational Complexity of the Minimum
Degree algorithm. Tech. rep. Institute for Computer Applications in Science
and Engineering, Hampton VA, (Dec. 2001). https://www.cs.purdue.edu/
homes/apothen/Papers/md-conf.pdf.
[5] Roger A. Horn and Charles R. Johnson. 2013. Matrix Analysis. (Second ed.).
CUP, 662. isbn: 978-0521548236.
[6] Sergio Jimenez et al. 2012. Soft Cardinality: A Parameterized Similarity Function
for Text Comparison. In Proc. of the 1st Joint Conference on Lexical and
Computational Semantics – Volume 1: Proc. of the Main Conference and the
Shared Task, and Volume 2: Proc. of the 6th Int. Workshop on Semantic Evaluation
(SemEval ’12). ACL. Montreal, Canada, 449–453. http://dl.acm.org/citation.
cfm?id=2387636.2387709.
[7] Jeff Johnson et al. 2017. Billion-scale similarity search with GPUs. ArXiv e-prints,
(Feb. 2017). arXiv: 1702.08734 [cs.CV].
[8] Kenta Mikawa et al. 2011. A proposal of extended cosine measure for distance
metric learning in text classification. In Systems, Man, and Cybernetics (SMC),
2011 IEEE International Conference on. IEEE, 1741–1746.
[9] Tomáš Mikolov et al. 2013. Efficient Estimation of Word Representations in
Vector Space. ArXiv e-prints, (Jan. 2013). arXiv: 1301.3781 [cs.CL].
[10] Preslav Nakov et al. 2017. SemEval-2017 task 3: community question answering.
In Proc. of the 11th International Workshop on Semantic Evaluation (SemEval ’17).
ACL, Vancouver, Canada, (Aug. 2017), 27–48.
[11] Behnam Neyshabur and Nathan Srebro. 2015. On Symmetric and Asymmetric
LSHs for Inner Product Search. In Proc. of the 32nd Int. Conference on Machine
Learning (ICML’15). Vol. 37. JMLR.org, Lille, France, 1926–1934. http://dl.acm.
org/citation.cfm?id=3045118.3045323.
[12] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
with Large Corpora. English. In Proc. of the LREC 2010 Workshop on New
Challenges for NLP Frameworks. ELRA, Valletta, Malta, (May 2010), 45–50.
http://is.muni.cz/publication/884893/en.
[13] Gerard Salton and Chris Buckley. 1988. Term-Weighting Approaches in Automatic
Text Retrieval. Inform. Processing and Management, 24, 513–523, 5.
[14] Grigori Sidorov et al. 2014. Soft similarity and soft cosine measure: Similarity
of features in vector space model. Computación y Sistemas, 18, 3, 491–504.
[15] G.W. Stewart. 1998. Matrix Algorithms: Volume 1: Basic Decompositions. Other
Titles in Applied Mathematics. SIAM, 458. isbn: 9781611971408.
[16] Mihalis Yannakakis. 1981. Computing the minimum fill-in is NP-complete.
SIAM Journal on Algebraic Discrete Methods, 2, 1, 77–79.
1See https://github.com/RaRe-Technologies/gensim/, pull requests 1827, and 2016.