D 2017

Flexible Similarity Search of Semantic Vectors Using Fulltext Search Engines

RŮŽIČKA, Michal, Vít NOVOTNÝ, Petr SOJKA, Jan POMIKÁLEK, Radim ŘEHŮŘEK et. al.

Basic information

Original name

Flexible Similarity Search of Semantic Vectors Using Fulltext Search Engines

Authors

RŮŽIČKA, Michal (203 Czech Republic, belonging to the institution), Vít NOVOTNÝ (203 Czech Republic, belonging to the institution), Petr SOJKA (203 Czech Republic, guarantor, belonging to the institution), Jan POMIKÁLEK (203 Czech Republic) and Radim ŘEHŮŘEK (203 Czech Republic)

Edition

Vienna, Austria, CEUR Workshop Proceedings, Vol. 1923, p. 1-12, 12 pp. 2017

Publisher

Neuveden

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Austria

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

RIV identification code

RIV/00216224:14330/17:00094375

Organization unit

Faculty of Informatics

ISSN

Keywords in English

vector space modelling; semantic vectors encodings; inverted-index; systems performance; document representations; Latent Semantic Analysis; doc2vec; GloVe; Elasticsearch; evaluation; performance optimization

Tags

International impact, Reviewed
Změněno: 3/1/2023 15:15, RNDr. Vít Starý Novotný, Ph.D.

Abstract

V originále

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. In our recent research we proposed a novel approach to ‘vector similarity searching’ over dense semantic vector representations. This approach can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. In this paper we validate our method using varied datasets ranging from text representations and embeddings (LSA, doc2vec, GloVe) to SIFT descriptors of image data. We show how our approach handles the indexing and querying in these domains, building a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch.

Links

MUNI/A/0997/2016, interní kód MU
Name: Aplikovaný výzkum na FI: vyhledávacích systémy, bezpečnost, vizualizace dat a virtuální realita.
Investor: Masaryk University, Applied research at FI: search systems, security, data visualization and virtual reality, Category A
TD03000295, research and development project
Name: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)
Investor: Technology Agency of the Czech Republic