Detailed Information on Publication Record
2017
Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ et. al.Basic information
Original name
Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
Authors
RYGL, Jan (203 Czech Republic), Jan POMIKÁLEK (203 Czech Republic), Radim ŘEHŮŘEK (203 Czech Republic), Michal RŮŽIČKA (203 Czech Republic, belonging to the institution), Vít NOVOTNÝ (203 Czech Republic, belonging to the institution) and Petr SOJKA (203 Czech Republic, guarantor, belonging to the institution)
Edition
Vancouver, Canada, Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017, p. 81-90, 10 pp. 2017
Publisher
Association for Computational Linguistics, ACL
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10201 Computer sciences, information science, bioinformatics
Country of publisher
Czech Republic
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
electronic version available online
RIV identification code
RIV/00216224:14330/17:00094366
Organization unit
Faculty of Informatics
ISBN
978-1-945626-62-3
Keywords (in Czech)
fulltextové vyhledávání; podobnostní hledání; vektorové prostory; vektorové reprezentace
Keywords in English
full-text search; similarity search; vector space; embeddings
Tags
International impact, Reviewed
Změněno: 19/9/2019 14:14, doc. RNDr. Petr Sojka, Ph.D.
Abstract
V originále
Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.
Links
MUNI/A/0997/2016, interní kód MU |
| ||
TD03000295, research and development project |
|