NOVOTNÝ, Vít. Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM. 2019.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM
Authors NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition 2019.
Other information
Original language English
Type of outcome Presentations at conferences
Field of Study 10200 1.2 Computer and information sciences
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
WWW Scientific poster
RIV identification code RIV/00216224:14330/19:00109518
Organization unit Faculty of Informatics
Keywords (in Czech) umělá inteligence; strojové učení; výpočetní lingvistika; získávání znalostí; zodpovídání otázek; slovní embeddingy; word2vec; word2bits; lineární algebra; výpočetní složitost
Keywords in English artificial intelligence; machine learning; computational linguistics; information retrieval; question answering; word embeddings; transfer learning; word2vec; word2bits; linear algebra; computational complexity
Tags machine learning
Tags International impact
Changed by Changed by: RNDr. Vít Novotný, Ph.D., učo 409729. Changed: 1. 11. 2021 09:35.

Our work is a scientific poster that was presented at the ML Prague 2019 conference during February 22–24, 2019.

The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. (2014) proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati (2017) showed that the SCM using word embedding similarity is highly effective in question answering systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary.

In our work, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.

MUNI/A/1145/2018, interní kód MUName: Aplikovaný výzkum na FI: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, techniky pro zpracování a vizualizaci velkých dat a rozšířená realita.
Investor: Masaryk University, Critical Infrastructure Software Architectures, Computer Systems Security, Data Processing and Visualization Techniques, and Augmented Reality, Category A
PrintDisplayed: 29. 3. 2023 15:07