Nearest-neighbor Search from Large Datasets using Narrow
Sketches

NAOYA, Higuchi, Imamura YASUNOBU, Vladimír MÍČ, Shinohara TAKESHI, Hirata KOUICHI and Kuboyama TETSUJI. Nearest-neighbor Search from Large Datasets using Narrow Sketches. In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - ICPRAM. Portugal: SciTePress, 2022, p. 401-410. ISBN 978-989-758-549-4. Available from: https://dx.doi.org/10.5220/0010817600003122.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Nearest-neighbor Search from Large Datasets using Narrow Sketches
Authors	NAOYA, Higuchi, Imamura YASUNOBU, Vladimír MÍČ (203 Czech Republic, guarantor, belonging to the institution), Shinohara TAKESHI, Hirata KOUICHI and Kuboyama TETSUJI.
Edition	Portugal, Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - ICPRAM, p. 401-410, 10 pp. 2022.
Publisher	SciTePress

Other information
Original language	English
Type of outcome	Proceedings paper
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Portugal
Confidentiality degree	is not subject to a state or trade secret
Publication form	printed version "print"
WWW	URL
RIV identification code	RIV/00216224:14330/22:00125541
Organization unit	Faculty of Informatics
ISBN	978-989-758-549-4
Doi	http://dx.doi.org/10.5220/0010817600003122
UT WoS	000819122200044
Keywords in English	Narrow Sketch;Nearest-neighbor Search;Large Dataset;Sketch Enumeration;Partially Restored Distance
Tags	DISA, firank_B
Tags	International impact, Reviewed
Changed by	Changed by: RNDr. Pavel Šmerk, Ph.D., učo 3880. Changed: 28/3/2023 10:10.

Abstract

We consider the nearest-neighbor search on large-scale high-dimensional datasets that cannot fit in the main memory. Sketches are bit strings that compactly express data points. Although it is usually thought that wide sketches are needed for high-precision searches, we use relatively narrow sketches such as 22-bit or 24-bit, to select a small set of candidates for the search. We use an asymmetric distance between data points and sketches as the criteria for candidate selection, instead of traditionally used Hamming distance. It can be considered a distance partially restoring quantization error. We utilize an efficient one-by-one sketch enumeration in the order of the partially restored distance to realize a fast candidate selection. We use two datasets to demonstrate the effectiveness of the method: YFCC100M-HNfc6 consisting of about 100 million 4,096 dimensional image descriptors and DEEP1B consisting of 1 billion 96 dimensional vectors. Using a standard desktop computer, we condu cted a nearest-neighbor search for a query on datasets stored on SSD, where vectors are represented by 8-bit integers. The proposed method executes the search in 5.8 seconds for the 400GB dataset YFCC100M, and 0.24 seconds for the 100GB dataset DEEP1B, while keeping the recall of 90%.

Links
EF16_019/0000822, research and development project	Name: Centrum excelence pro kyberkriminalitu, kyberbezpečnost a ochranu kritických informačních infrastruktur

PrintDisplayed: 27/4/2024 10:38

Nearest-neighbor Search from Large Datasets using Narrow Sketches

Other applications