Soft Cosine Measure: Capturing Term Similarity in the Bag of
Words VSM

NOVOTNÝ, Vít. Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM. 2019.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM
Authors	NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition	2019.

Other information
Original language	English
Type of outcome	Presentations at conferences
Field of Study	10200 1.2 Computer and information sciences
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
WWW	Scientific poster
RIV identification code	RIV/00216224:14330/19:00109518
Organization unit	Faculty of Informatics
Keywords (in Czech)	umělá inteligence; strojové učení; výpočetní lingvistika; získávání znalostí; zodpovídání otázek; slovní embeddingy; word2vec; word2bits; lineární algebra; výpočetní složitost
Keywords in English	artificial intelligence; machine learning; computational linguistics; information retrieval; question answering; word embeddings; transfer learning; word2vec; word2bits; linear algebra; computational complexity
Tags	machine learning
Tags	International impact
Changed by	Changed by: RNDr. Vít Starý Novotný, Ph.D., učo 409729. Changed: 1/11/2021 09:35.

Abstract

Our work is a scientific poster that was presented at the ML Prague 2019 conference during February 22–24, 2019.

The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. (2014) proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati (2017) showed that the SCM using word embedding similarity is highly effective in question answering systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary.

In our work, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.

Links
MUNI/A/1145/2018, interní kód MU	Name: Aplikovaný výzkum na FI: softwarové architektury kritických infrastruktur, bezpečnost počítačových systémů, techniky pro zpracování a vizualizaci velkých dat a rozšířená realita.
MUNI/A/1145/2018, interní kód MU	Investor: Masaryk University, Critical Infrastructure Software Architectures, Computer Systems Security, Data Processing and Visualization Techniques, and Augmented Reality, Category A

PrintDisplayed: 19/9/2024 16:16

Soft Cosine Measure: Capturing Term Similarity in the Bag of Words VSM

Other applications