NOVOTNÝ, Vít. Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM. 2019.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM
Authors NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition 2019.
Other information
Original language English
Type of outcome Presentations at conferences
Field of Study 10200 1.2 Computer and information sciences
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
WWW Scientific poster
RIV identification code RIV/00216224:14330/19:00109518
Organization unit Faculty of Informatics
Keywords (in Czech) umělá inteligence; strojové učení; výpočetní lingvistika; získávání znalostí; zodpovídání otázek; slovní embeddingy; word2vec; word2bits; lineární algebra; výpočetní složitost
Keywords in English artificial intelligence; machine learning; computational linguistics; information retrieval; question answering; word embeddings; transfer learning; word2vec; word2bits; linear algebra; computational complexity
Tags machine learning
Tags International impact
Changed by Changed by: RNDr. Vít Novotný, učo 409729. Changed: 25. 11. 2019 04:15.

Our work is a scientific poster that was presented at the ML Prague 2019 conference during February 22–24, 2019.

The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. (2014) proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati (2017) showed that the SCM using word embedding similarity is highly effective in question answering systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary.

In our work, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.

MUNI/A/1145/2018, internal MU codeName: Aplikovaný výzkum na FI: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, techniky pro zpracování a vizualizaci velkých dat a rozšířená realita.
Investor: Masaryk University, Grant Agency of Masaryk University, Category A
Type Name Uploaded/Created by Uploaded/Created Rights
mlprague-2019-scm.pdf   File version Novotný, V. 24. 4. 2019


Right to read
Right to upload
Right to administer:
  • a concrete person RNDr. Vít Novotný, učo 409729
Ask the author for author copy Displayed: 19. 9. 2021 00:51