Semantic Vector Encoding and Similarity Search Using Fulltext
Search Engines

D 2017

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ et. al.

Základní údaje

Originální název

Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines

Autoři

RYGL, Jan (203 Česká republika), Jan POMIKÁLEK (203 Česká republika), Radim ŘEHŮŘEK (203 Česká republika), Michal RŮŽIČKA (203 Česká republika, domácí), Vít NOVOTNÝ (203 Česká republika, domácí) a Petr SOJKA (203 Česká republika, garant, domácí)

Vydání

Vancouver, Canada, Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017, od s. 81-90, 10 s. 2017

Nakladatel

Association for Computational Linguistics, ACL

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

elektronická verze "online"

Odkazy

Preprint Article

Kód RIV

RIV/00216224:14330/17:00094366

Organizační jednotka

Fakulta informatiky

ISBN

978-1-945626-62-3

DOI

http://dx.doi.org/10.18653/v1/W17-2611

Klíčová slova česky

fulltextové vyhledávání; podobnostní hledání; vektorové prostory; vektorové reprezentace

Klíčová slova anglicky

full-text search; similarity search; vector space; embeddings

Štítky

acl, gensim, repl4nlp, scaletext

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 19. 9. 2019 14:14, doc. RNDr. Petr Sojka, Ph.D.

Anotace

V originále

Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.

Návaznosti

MUNI/A/0997/2016, interní kód MU

Název: Aplikovaný výzkum na FI: vyhledávacích systémy, bezpečnost, vizualizace dat a virtuální realita.

Investor: Masarykova univerzita, Aplikovaný výzkum na FI: vyhledávacích systémy, bezpečnost, vizualizace dat a virtuální realita., DO R. 2020_Kategorie A - Specifický výzkum - Studentské výzkumné projekty

TD03000295, projekt VaV

Název: Inteligentní software pro sémantické hledání dokumentů (Akronym: ISSHD)

Investor: Technologická agentura ČR, Inteligentní software pro sémantické hledání dokumentů

Citovat

RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ a Petr SOJKA. Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines. Online. In Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017. Vancouver, Canada: Association for Computational Linguistics, ACL, 2017, s. 81-90. ISBN 978-1-945626-62-3. Dostupné z: https://dx.doi.org/10.18653/v1/W17-2611.

@inproceedings{1386510,
   author = {Rygl, Jan and Pomikálek, Jan and Řehůřek, Radim and Růžička, Michal and Novotný, Vít and Sojka, Petr},
   address = {Vancouver, Canada},
   booktitle = {Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017},
   doi = {http://dx.doi.org/10.18653/v1/W17-2611},
   keywords = {full-text search; similarity search; vector space; embeddings},
   howpublished = {elektronická verze "online"},
   language = {eng},
   location = {Vancouver, Canada},
   isbn = {978-1-945626-62-3},
   pages = {81-90},
   publisher = {Association for Computational Linguistics, ACL},
   title = {Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines},
   url = {https://arxiv.org/abs/1706.00957},
   year = {2017}
}

TY  - JOUR
ID  - 1386510
AU  - Rygl, Jan - Pomikálek, Jan - Řehůřek, Radim - Růžička, Michal - Novotný, Vít - Sojka, Petr
PY  - 2017
TI  - Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines
PB  - Association for Computational Linguistics, ACL
CY  - Vancouver, Canada
SN  - 9781945626623
KW  - full-text search
KW  - similarity search
KW  - vector space
KW  - embeddings
UR  - https://arxiv.org/abs/1706.00957
L2  - https://doi.org/10.18653/v1/W17-2611
N2  - Vector representations and vector space modeling (VSM) play a central role in modern machine learning. We propose a novel approach to ‘vector similarity searching’ over dense semantic representations of words and documents that can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. We show that this approach allows the indexing and querying of dense vectors in text domains. This opens up exciting avenues for major efficiency gains, along with simpler deployment, scaling and monitoring. The end result is a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch. We empirically demonstrate its querying performance and quality by applying this solution to the task of semantic searching over a dense vector representation of the entire English Wikipedia.
ER  -

RYGL, Jan, Jan POMIKÁLEK, Radim ŘEHŮŘEK, Michal RŮŽIČKA, Vít NOVOTNÝ a Petr SOJKA. Semantic Vector Encoding and Similarity Search Using Fulltext Search Engines. Online. In \textit{Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP 2017 c/o ACL 2017}. Vancouver, Canada: Association for Computational Linguistics, ACL, 2017, s.~81-90. ISBN~978-1-945626-62-3. Dostupné z: https://dx.doi.org/10.18653/v1/W17-2611.

Podrobný výpis o publikaci