RŮŽIČKA, Michal, Vít NOVOTNÝ, Petr SOJKA, Jan POMIKÁLEK and Radim ŘEHŮŘEK. Flexible Similarity Search of Semantic Vectors Using Fulltext Search Engines. Online. In CEUR Workshop Proceedings, Vol. 1923. Vienna, Austria: Neuveden, 2017, p. 1-12. ISSN 1613-0073.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Flexible Similarity Search of Semantic Vectors Using Fulltext Search Engines
Authors RŮŽIČKA, Michal (203 Czech Republic, belonging to the institution), Vít NOVOTNÝ (203 Czech Republic, belonging to the institution), Petr SOJKA (203 Czech Republic, guarantor, belonging to the institution), Jan POMIKÁLEK (203 Czech Republic) and Radim ŘEHŮŘEK (203 Czech Republic).
Edition Vienna, Austria, CEUR Workshop Proceedings, Vol. 1923, p. 1-12, 12 pp. 2017.
Publisher Neuveden
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Austria
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW Workshop homepage Proceedings volume landing page Conference homepage Full text
RIV identification code RIV/00216224:14330/17:00094375
Organization unit Faculty of Informatics
ISSN 1613-0073
Keywords in English vector space modelling; semantic vectors encodings; inverted-index; systems performance; document representations; Latent Semantic Analysis; doc2vec; GloVe; Elasticsearch; evaluation; performance optimization
Tags International impact, Reviewed
Changed by Changed by: RNDr. Vít Starý Novotný, Ph.D., učo 409729. Changed: 3/1/2023 15:15.
Abstract
Vector representations and vector space modeling (VSM) play a central role in modern machine learning. In our recent research we proposed a novel approach to ‘vector similarity searching’ over dense semantic vector representations. This approach can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. In this paper we validate our method using varied datasets ranging from text representations and embeddings (LSA, doc2vec, GloVe) to SIFT descriptors of image data. We show how our approach handles the indexing and querying in these domains, building a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch.
Links
MUNI/A/0997/2016, interní kód MUName: Aplikovaný výzkum na FI: vyhledávacích systémy, bezpečnost, vizualizace dat a virtuální realita.
Investor: Masaryk University, Applied research at FI: search systems, security, data visualization and virtual reality, Category A
TD03000295, research and development projectName: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)
Investor: Technology Agency of the Czech Republic
PrintDisplayed: 27/7/2024 13:28