Scaling Learned Metric Index to 100M Datasets

D 2025

Scaling Learned Metric Index to 100M Datasets

PROCHÁZKA, David; Terézia SLANINÁKOVÁ; Jozef ČERŇANSKÝ; Jaroslav OĽHA; Matej ANTOL et. al.

Základní údaje

Originální název

Scaling Learned Metric Index to 100M Datasets

Autoři

PROCHÁZKA, David (203 Česká republika, domácí); Terézia SLANINÁKOVÁ (703 Slovensko, domácí); Jozef ČERŇANSKÝ (703 Slovensko, domácí); Jaroslav OĽHA (703 Slovensko, domácí); Matej ANTOL (703 Slovensko, domácí) a Vlastislav DOHNAL (203 Česká republika, garant, domácí)

Vydání

BERLIN, SIMILARITY SEARCH AND APPLICATIONS, SISAP 2024, od s. 266-273, 8 s. 2025

Nakladatel

Springer

Další údaje

Jazyk

angličtina

Typ výsledku

Stať ve sborníku

Obor

10200 1.2 Computer and information sciences

Stát vydavatele

Německo

Utajení

není předmětem státního či obchodního tajemství

Forma vydání

tištěná verze "print"

Impakt faktor

Impact factor: 0.402 v roce 2005

Organizační jednotka

Fakulta informatiky

ISBN

978-3-031-75822-5

ISSN

DOI

http://dx.doi.org/10.1007/978-3-031-75823-2_22

UT WoS

001422992900022

Klíčová slova anglicky

learned metric index;high-dimensional data;memory efficiency;on-disk index;approximate nearest neighbor search;similarity search

Štítky

approximate search, CODA Research Group, DISA, learned indexing, LMI, machine learning, nearest-neighbors query, rivok, similarity search

Příznaky

Mezinárodní význam, Recenzováno

Změněno: 25. 3. 2025 16:12, Mgr. Eva Špillingová

Anotace

V originále

Learned indexing of high-dimensional data is an indexing approach that is still in the process of proving its viability – the Learned Metric Index (LMI) stands as one of the pioneering methods in this regard. Earlier implementation of LMI [Slanináková et al., SISAP 2023] primarily served as experimental prototype, operating under unrealistic assumptions, such as the availability of unlimited main memory or unbounded index construction time. Recently, however, LMI made the leap towards practical applicability on real-world datasets when it was successfully deployed to efficiently index 214 million protein structures for near-instantaneous retrieval [Procházka et al., Nucleic Acids Research 2024]. This paper details the key improvements that enabled this transition, including the introduction of parallel query processing (with the possibility of GPU acceleration), adaptive memory usage, pre-construction of memory buckets for contiguous access, a shift from k-means to spherical k-means clustering, and faster index construction through fewer epochs and the use of smaller training samples. LMI is now capable of handling 100M datasets and supports both in-memory and on-disk indexing, marking several important steps towards practical viability of AI-enhanced indexes for high-dimensional complex data in real-world settings.

Návaznosti

GF23-07040K, projekt VaV

Název: Naučené indexy pro podobností hledání

Investor: Grantová agentura ČR, Naučené indexy pro podobností hledání, Lead agentura

MUNI/A/1590/2023, interní kód MU

Název: Využití technik umělé inteligence pro zpracování dat, komplexní analýzy a vizualizaci rozsáhlých dat

Investor: Masarykova univerzita, Využití technik umělé inteligence pro zpracování dat, komplexní analýzy a vizualizaci rozsáhlých dat

90254, velká výzkumná infrastruktura

Název: e-INFRA CZ II

90255, velká výzkumná infrastruktura

Název: ELIXIR CZ III

Citovat

PROCHÁZKA, David; Terézia SLANINÁKOVÁ; Jozef ČERŇANSKÝ; Jaroslav OĽHA; Matej ANTOL a Vlastislav DOHNAL. Scaling Learned Metric Index to 100M Datasets. In Edgar Chávez, Benjamin Kimia, Jakub Lokoč, Marco Patella, Jan Sedmidubsky. SIMILARITY SEARCH AND APPLICATIONS, SISAP 2024. BERLIN: Springer, 2025, s. 266-273. ISBN 978-3-031-75822-5. Dostupné z: https://dx.doi.org/10.1007/978-3-031-75823-2_22.

@inproceedings{2428378,
   author = {Procházka, David and Slanináková, Terézia and Čerňanský, Jozef and Oľha, Jaroslav and Antol, Matej and Dohnal, Vlastislav},
   address = {BERLIN},
   booktitle = {SIMILARITY SEARCH AND APPLICATIONS, SISAP 2024},
   doi = {http://dx.doi.org/10.1007/978-3-031-75823-2_22},
   editor = {Edgar Chávez, Benjamin Kimia, Jakub Lokoč, Marco Patella, Jan Sedmidubsky},
   keywords = {learned metric index;high-dimensional data;memory efficiency;on-disk index;approximate nearest neighbor search;similarity search},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {BERLIN},
   isbn = {978-3-031-75822-5},
   pages = {266-273},
   publisher = {Springer},
   title = {Scaling Learned Metric Index to 100M Datasets},
   year = {2025}
}

TY  - CONF
ID  - 2428378
AU  - Procházka, David - Slanináková, Terézia - Čerňanský, Jozef - Oľha, Jaroslav - Antol, Matej - Dohnal, Vlastislav
PY  - 2025
TI  - Scaling Learned Metric Index to 100M Datasets
PB  - Springer
CY  - BERLIN
SN  - 9783031758225
KW  - learned metric index;high-dimensional data;memory efficiency;on-disk index;approximate nearest neighbor search;similarity search
N2  - Learned indexing of high-dimensional data is an indexing approach that is still in the process of proving its viability – the Learned Metric Index (LMI) stands as one of the pioneering methods in this regard. Earlier implementation of LMI [Slanináková et al., SISAP 2023] primarily served as experimental prototype, operating under unrealistic assumptions, such as the availability of unlimited main memory or unbounded index construction time. Recently, however, LMI made the leap towards practical applicability on real-world datasets when it was successfully deployed to efficiently index 214 million protein structures for near-instantaneous retrieval [Procházka et al., Nucleic Acids Research 2024]. This paper details the key improvements that enabled this transition, including the introduction of parallel query processing (with the possibility of GPU acceleration), adaptive memory usage, pre-construction of memory buckets for contiguous access, a shift from k-means to spherical k-means clustering, and faster index construction through fewer epochs and the use of smaller training samples. LMI is now capable of handling 100M datasets and supports both in-memory and on-disk indexing, marking several important steps towards practical viability of AI-enhanced indexes for high-dimensional complex data in real-world settings.
ER  -

PROCHÁZKA, David; Terézia SLANINÁKOVÁ; Jozef ČERŇANSKÝ; Jaroslav OĽHA; Matej ANTOL a Vlastislav DOHNAL. Scaling Learned Metric Index to 100M Datasets. In Edgar Chávez, Benjamin Kimia, Jakub Lokoč, Marco Patella, Jan Sedmidubsky. \textit{SIMILARITY SEARCH AND APPLICATIONS, SISAP 2024}. BERLIN: Springer, 2025, s.~266-273. ISBN~978-3-031-75822-5. Dostupné z: https://dx.doi.org/10.1007/978-3-031-75823-2\_{}22.

Přehled o publikaci