Detailed Information on Publication Record
2022
Blooming Onion: Efficient Deduplication through Approximate Membership Testing
HERMAN, OndřejBasic information
Original name
Blooming Onion: Efficient Deduplication through Approximate Membership Testing
Authors
HERMAN, Ondřej (203 Czech Republic, guarantor, belonging to the institution)
Edition
Brno, Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022, p. 91-95, 5 pp. 2022
Publisher
Tribun EU
Other information
Language
English
Type of outcome
Stať ve sborníku
Field of Study
10200 1.2 Computer and information sciences
Country of publisher
Czech Republic
Confidentiality degree
není předmětem státního či obchodního tajemství
Publication form
printed version "print"
References:
RIV identification code
RIV/00216224:14330/22:00127485
Organization unit
Faculty of Informatics
ISBN
978-80-263-1752-4
ISSN
Keywords in English
deduplication; text corpora; Bloom filter
Změněno: 15/5/2024 09:54, RNDr. Pavel Šmerk, Ph.D.
Abstract
V originále
Deduplication of source text is an important step in corpus building. Maximum corpus sizes have been grown significantly, along with the requirements for computing resources required for processing them. This article explores reducing the cost of deduplication by applying approximate membership testing using Bloom filtering.
Links
LM2018101, research and development project |
|