Continually Learned Index for Dynamic Approximate Nearest Neighbor Search

Jakubík, Michal

Diplomová práce

Continually Learned Index for Dynamic Approximate Nearest Neighbor Search

Bc. Michal Jakubík

Anotace

Moderné aplikácie sa čoraz častejšie spoliehajú na vysokodimenzionálne vektorové embeddingy na reprezentáciu dátových modalít, ako sú text, obraz alebo proteínové štruktúry. Efektívne vyhľadávanie v takýchto dátach rieši podobnostné vyhľadávanie, pri ktorom vektorové indexy, ako napríklad HNSW alebo IVF-PQ, vykonávajú aproximované vyhľadávanie najbližších susedov (Approximate Nearest Neighbor, ANN …víceméně

Abstract

Modern applications increasingly rely on high-dimensional vector embeddings to represent data modalities such as text, images, or protein structures. Efficient retrieval of such data is addressed by similarity search, where vector indexes such as HNSW or IVF-PQ perform approximate nearest neighbor (ANN) search to find the most similar vectors to a given query. Learned indexes represent an alternative …víceméně

Klíčová slova

learned indexing dynamic index approximate nearest neighbors continual learning embeddings naučené indexovanie dynamický index priebežné učenie

Zadání práce

Learned index structures represent a promising approach to approximate nearest neighbor search in high-dimensional spaces, replacing traditional tree-based or hash-based indexing with neural network classifiers that route queries to relevant buckets. While static learned indices have demonstrated competitive retrieval performance, they fundamentally assume a fixed dataset, an assumption that rarely holds in real-world applications where data is continuously inserted and deleted. Extending learned indices with dynamic update capabilities introduces substantial challenges: the underlying classifier must grow to accommodate new buckets while maintaining balanced partitioning, and the model must be fine-tuned after structural changes without catastrophically forgetting previously learned routing decisions. Furthermore, as the data distribution evolves over time, whether gradually through incremental insertions or abruptly through distribution shift, the index may degrade to a point where full retraining becomes unavoidable. While recent work has proposed initial dynamic extensions of learned indices, a systematic and modular investigation of the key design dimensions: how to maintain representative training data, how to restructure buckets efficiently, how to adapt the model without forgetting, and how to detect the need for retraining; has not yet been conducted.

The student will design and implement Continually Learned Index that supports insertion and deletion operations with automatic classifier expansion and bucket splitting to maintain index balance. The experimental work will be organized around four independent research questions, each isolating a critical design dimension: first, the comparison of replay buffer population strategies, including random sampling, herding, and influence functions; second, the evaluation of reclustering approaches such as full reclustering, iterative splitting, and DeDrift applied to selected subsets of buckets; third, the comparison of fine-tuning loss functions, including standard cross-entropy, knowledge distillation, elastic weight consolidation, and synaptic intelligence; and fourth, the analysis of architectural adaptation strategies during fine-tuning, covering various combinations of freezing and resetting classifier and backbone weights. The best-performing technique from each dimension will then be combined into a single configuration. Additionally, the student will investigate the detection of retraining necessity, examining whether metrics such as the number of added buckets and model prediction confidence on newly inserted vectors can reliably signal when full retraining from scratch is required, both under gradual performance degradation and abrupt distribution shift. The resulting dynamic learned index will be evaluated and compared against existing baselines, including the static Learned Metric Index and other dynamic learned index implementation, in terms of retrieval accuracy, query and update latency, and memory consumption. The student will then conduct large-scale benchmarking experiments on up to 100 million vectors from subsets of the LAION-5B dataset, subset of Deep1B dataset and llama-128-ip attention and imagenet-align-640-normalized text-to-image datasets from VIBE benchmark, across static and several dynamic scenarios, including explicit simulation of distribution shifts. The source code created by the student will be publicly available in the IS MU archive under the GNU LGPL license.

Administrativní informace

Práce zkontrolována:
20. 5. 2026 10:55, RNDr. David Procházka, učo 485104

Zadáno/změněno 18. 6. 2026 08:16, Miroslava Tomíčková, učo 114718
Záznam založen 27. 4. 2026 12:14, Mgr. Lenka Kubová, učo 247849
Zveřejnit od 19. 5. 2026 12:35, Miroslava Tomíčková, učo 114718
Práce převzata 19. 5. 2026 12:35, Miroslava Tomíčková, učo 114718

Plný text práce

5,6 MB / soubor PDF

Přílohy (1)

Přílohy

cli.zip

Příloha

Jazyk práce

angličtina

Termín obhajoby

17. 6. 2026

Práce byla úspěšně obhájena

Vedoucí

RNDr. David Procházka, učo 485104
KSUZD FI MU

Posudek vedoucího

Oponent

Mgr. et Mgr. Jaroslav Oľha, Ph.D., učo 348646
VSDB ÚHE Teorie LF MU

Posudek oponenta

Konzultanti

doc. RNDr. Vlastislav Dohnal, Ph.D., učo 2952
KSUZD FI MU

Mgr. Emma Sommerová, učo 514368
KSUZD FI MU

Citovat tuto práci

Citace dle normy ČSN ISO 690

JAKUBÍK, Michal. Continually Learned Index for Dynamic Approximate Nearest Neighbor Search. Online. Diplomová práce. Brno: Masarykova univerzita, Fakulta informatiky. 2026. Dostupné z: https://is.muni.cz/th/o3x7i/.

@MastersThesis{Jakubik2026thesis, AUTHOR = {Jakubík, Michal}, TITLE = {Continually Learned Index for Dynamic Approximate Nearest Neighbor Search}, YEAR = {2026}, TYPE = {Diplomová práce}, INSTITUTION = {Masarykova univerzita, Fakulta informatiky}, LOCATION = {Brno}, SUPERVISOR = {David Procházka}, URL = {https://is.muni.cz/th/o3x7i/}, URL_DATE = {2026-06-23}, }

{{Citace kvalifikační práce | příjmení = Jakubík | jméno = Michal | instituce = Masarykova univerzita, Fakulta informatiky | odkaz na instituci = Fakulta informatiky Masarykovy univerzity | titul = Continually Learned Index for Dynamic Approximate Nearest Neighbor Search | url = https://is.muni.cz/th/o3x7i/ | typ práce = Diplomová práce | vedoucí = David Procházka | odkaz na vedoucího = {{UČO na článek|485104}} | místo = Brno | rok = 2026 | počet stran = | strany = | citace = 2026-06-23 | poznámka = | jazyk = en }}

Masarykova univerzita Fakulta informatiky

Studijní program

Umělá inteligence a zpracování dat

Plán

Strojové učení a umělá inteligence

Práce na příbuzné téma

Seznam prací, které mají shodná klíčová slova.

Dynamic Learned Indexing for Vector Data Using Continual Learning

Ing. Filip Forgáč
Temporally Dynamic Learned Index for Similarity Search over Complex Data

Mgr. Emma Sommerová, učo 514368
Python Libraries for Storing Embeddings on Disk

Bc. Jakub Neruda, učo 524944
Comparison of Neural Embeddings for Tracking Cells

Mgr. Miroslav Mažgut
Learned Indexing in Vector Database Management Systems

Mgr. Jakub Žovák
Extrakce informací ze sportovních přenosů

Ing. Jakub Dvořák
Integration of an AI Copilot into Web Development Processes

Mgr. Dominik Adam
A Quest for Information: Enhancing Game-Based Learning with LLM-Driven NPCs

Mgr. Tereza Tódová

Podobné práce

Název

Vložil

Vloženo

Práva

Archiv závěrečné práce Michal Jakubík FI N-UIZD SUUI o3x7i/10

Kubová, L.

27. 4. 2026

Složky

Soubory

Anotace anglicky annotation_english.txt

Jakubík, M.

17. 5. 2026

Anotace česky annotation.txt

Jakubík, M.

17. 5. 2026

Klíčová slova keywords.txt

Jakubík, M.

17. 5. 2026

Plný text práce thesis.pdf

Jakubík, M.

18. 5. 2026

Posudek oponenta posudek_oponenta_Olha.pdf

Oľha, J.

5. 6. 2026

Posudek vedoucího posudek_vedouciho_Prochazka.pdf

Procházka, D.

4. 6. 2026

Příloha cli.zip

Jakubík, M.

18. 5. 2026

Zpráva o průběhu obhajoby prubeh_obhajoby.pdf

Brázdil, T.

17. 6. 2026

Přidání souboru

Soubor nebo složku lze nahrát pomocí tlačítka Přidat.
Další operace se soubory

Podrobnosti lze zjistit označením příslušného řádku.
Pohled pro experty

Pro častou práci je možné zvolit režim Více možností.
Vyhledávání souborů

Vyhledávaný výraz můžete zadat přímo do adresního řádku.
Rychlý přístup k souborům

Pomocí funkce Nedávné je možné se rychle vrátit k právě prohlíženým souborům. Oblíbené soubory je také možné označit Hvězdičkou.

Závěrečná práce: Bc. Michal Jakubík: Continually Learned Index for Dynamic Approximate Nearest Neighbor Search

Diplomová práce

Continually Learned Index for Dynamic Approximate Nearest Neighbor Search

Anotace

Abstract

Klíčová slova

Zadání práce

Přílohy

cli.zip

Vedoucí

Oponent

Konzultanti

Citace dle normy ČSN ISO 690

Práce na příbuzné téma

Složky

Soubory

Přidání souboru

Další operace se soubory

Pohled pro experty

Vyhledávání souborů

Rychlý přístup k souborům