RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ and Petr SOJKA. Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines. Online. In Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017. Vancouver, Canada: Association for Computational Linguistics, ACL, 2017, p. 81-90. ISBN 978-1-945626-62-3. Available from: https://dx.doi.org/10.18653/v1/W17-2611.
Other formats:   BibTeX LaTeX RIS
Basic information
Original name Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
Authors RYGL, Jan (203 Czech Republic), Jan POMIKÁLEK (203 Czech Republic), Radim ŘEHŮŘEK (203 Czech Republic), Michal RŮŽIČKA (203 Czech Republic, belonging to the institution), Vít NOVOTNÝ (203 Czech Republic, belonging to the institution) and Petr SOJKA (203 Czech Republic, guarantor, belonging to the institution).
Edition Vancouver, Canada, Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017, p. 81-90, 10 pp. 2017.
Publisher Association for Computational Linguistics, ACL
Other information
Original language English
Type of outcome Proceedings paper
Field of Study 10201 Computer sciences, information science, bioinformatics
Country of publisher Czech Republic
Confidentiality degree is not subject to a state or trade secret
Publication form electronic version available online
WWW Preprint Article
RIV identification code RIV/00216224:14330/17:00094366
Organization unit Faculty of Informatics
ISBN 978-1-945626-62-3
Doi http://dx.doi.org/10.18653/v1/W17-2611
Keywords (in Czech) fulltextové vyhledávání; podobnostní hledání; vektorové prostory; vektorové reprezentace
Keywords in English full-text search; similarity search; vector space; embeddings
Tags acl, gensim, repl4nlp, scaletext
Tags International impact, Reviewed
Changed by Changed by: doc. RNDr. Petr Sojka, Ph.D., učo 2378. Changed: 19/9/2019 14:14.
Abstract
Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.
Links
MUNI/A/0997/2016, interní kód MUName: Aplikovaný výzkum na FI: vyhledávacích systémy, bezpečnost, vizualizace dat a virtuální realita.
Investor: Masaryk University, Applied research at FI: search systems, security, data visualization and virtual reality, Category A
TD03000295, research and development projectName: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)
Investor: Technology Agency of the Czech Republic
PrintDisplayed: 21/5/2024 07:46