Detailed Information on Publication Record
2023
CRANBERRY: Memory-Effective Search in 100M High-Dimensional CLIP Vectors
MÍČ, Vladimír, Jan SEDMIDUBSKÝ and Pavel ZEZULABasic information
Original name
CRANBERRY: Memory-Effective Search in 100M High-Dimensional CLIP Vectors
Authors
MÍČ, Vladimír (203 Czech Republic, guarantor), Jan SEDMIDUBSKÝ (203 Czech Republic, belonging to the institution) and Pavel ZEZULA (203 Czech Republic, belonging to the institution)
Edition
Cham, 16th International Conference on Similarity Search and Applications (SISAP), p. 300-308, 9 pp. 2023
Publisher
Springer
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10200 1.2 Computer and information sciences
Country of publisher
Czech Republic
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
electronic version available online
References:
Impact factor
Impact factor: 0.402 in 2005
RIV identification code
RIV/00216224:14330/23:00131529
Organization unit
Faculty of Informatics
ISBN
978-3-031-46993-0
ISSN
Keywords in English
approximate similarity searching;high-dimensional data;indexing;filtering;LAION dataset
Tags
International impact, Reviewed
Změněno: 5/3/2024 11:29, doc. RNDr. Jan Sedmidubský, Ph.D.
Abstract
V originále
Recent advances in cross-modal multimedia data analysis necessarily require efficient similarity search on the scales of hundreds of millions of high-dimensional vectors. We address this task by proposing the CRANBERRY algorithm that specifically combines and tunes several existing similarity search strategies. In particular, the algorithm: (1) employs the Voronoi partitioning to obtain a query-relevant candidate set in constant time, (2) applies filtering techniques to prune the obtained candidates significantly, and (3) re-rank the retained candidate vectors with respect to the query vector. Applied to the dataset of 100 million 768-dimensional vectors, the algorithm evaluates 10NN queries with 90% recall and query latency of 1.2s on average, all with a throughput of 15 queries per second on a server with 56 core-CPU, and 4.7q/sec. on a PC.