D 2017

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ et. al.

Basic information

Original name

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

Authors

RYGL, Jan (203 Czech Republic), Jan POMIKÁLEK (203 Czech Republic), Radim ŘEHŮŘEK (203 Czech Republic), Michal RŮŽIČKA (203 Czech Republic, belonging to the institution), Vít NOVOTNÝ (203 Czech Republic, belonging to the institution) and Petr SOJKA (203 Czech Republic, guarantor, belonging to the institution)

Edition

Vancouver, Canada, Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017, p. 81-90, 10 pp. 2017

Publisher

Association for Computational Linguistics, ACL

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

References:

RIV identification code

RIV/00216224:14330/17:00094366

Organization unit

Faculty of Informatics

ISBN

978-1-945626-62-3

Keywords (in Czech)

fulltextové vyhledávání; podobnostní hledání; vektorové prostory; vektorové reprezentace

Keywords in English

full-text search; similarity search; vector space; embeddings

Tags

International impact, Reviewed
Změněno: 19/9/2019 14:14, doc. RNDr. Petr Sojka, Ph.D.

Abstract

V originále

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.

Links

MUNI/A/0997/2016, interní kód MU
Name: Aplikovaný výzkum na FI: vyhledávacích systémy, bezpečnost, vizualizace dat a virtuální realita.
Investor: Masaryk University, Applied research at FI: search systems, security, data visualization and virtual reality, Category A
TD03000295, research and development project
Name: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)
Investor: Technology Agency of the Czech Republic