Semantic Vector Encoding and Similarity Search Using Fulltext
Search Engines

D 2017

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ et. al.

Basic information

Original name

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

Authors

RYGL, Jan (203 Czech Republic), Jan POMIKÁLEK (203 Czech Republic), Radim ŘEHŮŘEK (203 Czech Republic), Michal RŮŽIČKA (203 Czech Republic, belonging to the institution), Vít NOVOTNÝ (203 Czech Republic, belonging to the institution) and Petr SOJKA (203 Czech Republic, guarantor, belonging to the institution)

Edition

Vancouver, Canada, Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017, p. 81-90, 10 pp. 2017

Publisher

Association for Computational Linguistics, ACL

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

Preprint Article

RIV identification code

RIV/00216224:14330/17:00094366

Organization unit

Faculty of Informatics

ISBN

978-1-945626-62-3

DOI

http://dx.doi.org/10.18653/v1/W17-2611

Keywords (in Czech)

fulltextové vyhledávání; podobnostní hledání; vektorové prostory; vektorové reprezentace

Keywords in English

full-text search; similarity search; vector space; embeddings

Abstract

V originále

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.

Links

MUNI/A/0997/2016, interní kód MU

Name: Aplikovaný výzkum na FI: vyhledávacích systémy, bezpečnost, vizualizace dat a virtuální realita.

Investor: Masaryk University, Applied research at FI: search systems, security, data visualization and virtual reality, Category A

TD03000295, research and development project

Name: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)

Investor: Technology Agency of the Czech Republic

Citovat

RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ and Petr SOJKA. Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines. Online. In Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017. Vancouver, Canada: Association for Computational Linguistics, ACL, 2017, p. 81-90. ISBN 978-1-945626-62-3. Available from: https://dx.doi.org/10.18653/v1/W17-2611.

@inproceedings{1386510,
   author = {Rygl, Jan and Pomikálek, Jan and Řehůřek, Radim and Růžička, Michal and Novotný, Vít and Sojka, Petr},
   address = {Vancouver, Canada},
   booktitle = {Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017},
   doi = {http://dx.doi.org/10.18653/v1/W17-2611},
   keywords = {full-text search; similarity search; vector space; embeddings},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Vancouver, Canada},
   isbn = {978-1-945626-62-3},
   pages = {81-90},
   publisher = {Association for Computational Linguistics, ACL},
   title = {Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines},
   url = {https://arxiv.org/abs/1706.00957},
   year = {2017}
}

TY  - JOUR
ID  - 1386510
AU  - Rygl, Jan - Pomikálek, Jan - Řehůřek, Radim - Růžička, Michal - Novotný, Vít - Sojka, Petr
PY  - 2017
TI  - Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
PB  - Association for Computational Linguistics, ACL
CY  - Vancouver, Canada
SN  - 9781945626623
KW  - full-text search
KW  - similarity search
KW  - vector space
KW  - embeddings
UR  - https://arxiv.org/abs/1706.00957
L2  - https://doi.org/10.18653/v1/W17-2611
N2  - Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.
ER  -

RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ and Petr SOJKA. Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines. Online. In \textit{Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017}. Vancouver, Canada: Association for Computational Linguistics, ACL, 2017, p.~81-90. ISBN~978-1-945626-62-3. Available from: https://dx.doi.org/10.18653/v1/W17-2611.

Detailed Information on Publication Record