Scaling Learned Metric Index to 100M Datasets

PROCHÁZKA, David, Terézia SLANINÁKOVÁ, Jozef ČERŇANSKÝ, Jaroslav OĽHA, Matej ANTOL and Vlastislav DOHNAL. Scaling Learned Metric Index to 100M Datasets. Online. In 17th International Conference on Similarity Search and Applications (SISAP 2024). Springer, 2024, 8 pp.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Scaling Learned Metric Index to 100M Datasets
Authors	PROCHÁZKA, David, Terézia SLANINÁKOVÁ, Jozef ČERŇANSKÝ, Jaroslav OĽHA, Matej ANTOL and Vlastislav DOHNAL.
Edition	17th International Conference on Similarity Search and Applications (SISAP 2024), 8 pp. 2024.
Publisher	Springer

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10200 1.2 Computer and information sciences
Confidentiality degree	is not subject to a state or trade secret
Publication form	electronic version available online
Organization unit	Faculty of Informatics
Keywords in English	learned metric index;high-dimensional data;memory efficiency;on-disk index;approximate nearest neighbor search;similarity search
Tags	learned indexing, LMI
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Matej Antol, Ph.D., učo 325040. Changed: 9/9/2024 17:44.

Abstract

Learned indexing of high-dimensional data is an indexing approach that is still in the process of proving its viability – the Learned Metric Index (LMI) stands as one of the pioneering methods in this regard. Earlier implementation of LMI [Slanináková et al., SISAP 2023] primarily served as experimental prototype, operating under unrealistic assumptions, such as the availability of unlimited main memory or unbounded index construction time. Recently, however, LMI made the leap towards practical applicability on real-world datasets when it was successfully deployed to efficiently index 214 million protein structures for near-instantaneous retrieval [Procházka et al., Nucleic Acids Research 2024]. This paper details the key improvements that enabled this transition, including the introduction of parallel query processing (with the possibility of GPU acceleration), adaptive memory usage, pre-construction of memory buckets for contiguous access, a shift from k-means to spherical k-means clustering, and faster index construction through fewer epochs and the use of smaller training samples. LMI is now capable of handling 100M datasets and supports both in-memory and on-disk indexing, marking several important steps towards practical viability of AI-enhanced indexes for high-dimensional complex data in real-world settings.

Links
GF23-07040K, research and development project	Name: Naučené indexy pro podobností hledání
GF23-07040K, research and development project	Investor: Czech Science Foundation, Learned Indexing for Similarity Searching, Lead Agency
90254, large research infrastructures	Name: e-INFRA CZ II

PrintDisplayed: 11/10/2024 14:26

Scaling Learned Metric Index to 100M Datasets

Other applications