NOVOTNÝ, Vít. Implementation Notes for the Soft Cosine Measure. Online. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18). Torino, Italy: Association for Computing Machinery, 2018, p. 1639-1642. ISBN 978-1-4503-6014-2. Available from: https://dx.doi.org/10.1145/3269206.3269317.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Implementation Notes for the Soft Cosine Measure
Authors NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition Torino, Italy, Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18), p. 1639-1642, 4 pp. 2018.
Publisher Association for Computing Machinery
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Italy
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW Postprint DOI
RIV identification code RIV/00216224:14330/18:00101853
Organization unit Faculty of Informatics
ISBN 978-1-4503-6014-2
Doi http://dx.doi.org/10.1145/3269206.3269317
UT WoS 000455712300190
Keywords in English Vector Space Model; computational complexity; similarity measure
Tags core_A, firank_A, information retrieval, ranking, SCM, similarity search, soft cosine measure
Tags International impact, Reviewed
Changed by Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 25/4/2022 04:56.
Abstract
The standard bag-of-words vector space model (VSM) is efficient, and ubiquitous in information retrieval, but it underestimates the similarity of documents with the same meaning, but different terminology. To overcome this limitation, Sidorov et al. proposed the Soft Cosine Measure (SCM) that incorporates term similarity relations. Charlet and Damnati showed that the SCM is highly effective in question answering (QA) systems. However, the orthonormalization algorithm proposed by Sidorov et al. has an impractical time complexity of O(n^4), where n is the size of the vocabulary. In this paper, we prove a tighter lower worst-case time complexity bound of O(n^3). We also present an algorithm for computing the similarity between documents and we show that its worst-case time complexity is O(1) given realistic conditions. Lastly, we describe implementation in general-purpose vector databases such as Annoy, and Faiss and in the inverted indices of text search engines such as Apache Lucene, and ElasticSearch. Our results enable the deployment of the SCM in real-world information retrieval systems.
Links
MUNI/A/1038/2017, interní kód MUName: Zapojení studentů Fakulty informatiky do mezinárodní vědecké komunity 18
Investor: Masaryk University, Category A
MUNI/A/1213/2017, interní kód MUName: Aplikovaný výzkum na FI: bezpečnost počítačových systémů, SW architektury kritických infrastruktur, zpracování velkých dat, vizualizace dat a virtuální realita
Investor: Masaryk University, Applied research at FI: computer systems security, SW architecture of critical infrastructure, big data processing, data visualization and virtual reality, Category A
TD03000295, research and development projectName: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)
Investor: Technology Agency of the Czech Republic
Type Name Uploaded/Created by Uploaded/Created Rights
1808.09407.pdf   File version Starý Novotný, V. 30/10/2018

Properties

Address within IS
https://is.muni.cz/auth/publication/1430596/1808.09407.pdf
Address for the users outside IS
https://is.muni.cz/publication/1430596/1808.09407.pdf
Address within Manager
https://is.muni.cz/auth/publication/1430596/1808.09407.pdf?info
Address within Manager for the users outside IS
https://is.muni.cz/publication/1430596/1808.09407.pdf?info
Uploaded/Created
Tue 30/10/2018 22:27, RNDr. Vít Starý Novotný, Ph.D.

Rights

Right to read
  • anyone on the Internet
  • a concrete person RNDr. Pavel Šmerk, Ph.D., učo 3880
  • a concrete person RNDr. Vít Starý Novotný, Ph.D., učo 409729
Right to upload
 
Right to administer:
  • a concrete person RNDr. Pavel Šmerk, Ph.D., učo 3880
  • a concrete person RNDr. Vít Starý Novotný, Ph.D., učo 409729
Attributes
 

1808.09407.pdf

Application
Open the file
Download file.
Address within IS
https://is.muni.cz/auth/publication/1430596/1808.09407.pdf
Address for the users outside IS
https://is.muni.cz/publication/1430596/1808.09407.pdf
File type
PDF (application/pdf)
Size
700,5 KB
Hash md5
a273b4e79382e4d01e0e87780c129cdb
Uploaded/Created
Tue 30/10/2018 22:27

1808.09407.txt

Application
Open the file
Download file.
Address within IS
https://is.muni.cz/auth/publication/1430596/1808.09407.txt
Address for the users outside IS
https://is.muni.cz/publication/1430596/1808.09407.txt
File type
plain text (text/plain)
Size
24 KB
Hash md5
008566e260d5e794ba91a54313baa33d
Uploaded/Created
Tue 30/10/2018 22:31
Print
Report a file uploaded without authorization. Displayed: 27/4/2024 03:13