HERMAN, Ondřej. Blooming Onion: Efficient Deduplication through Approximate Membership Testing. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022. Brno: Tribun EU, 2022, s. 91-95. ISBN 978-80-263-1752-4. |
Další formáty:
BibTeX
LaTeX
RIS
@inproceedings{2240155, author = {Herman, Ondřej}, address = {Brno}, booktitle = {Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022}, editor = {Aleš Horák, Pavel Rychlý, Adam Rambousek}, keywords = {deduplication; text corpora; Bloom filter}, howpublished = {tištěná verze "print"}, language = {eng}, location = {Brno}, isbn = {978-80-263-1752-4}, pages = {91-95}, publisher = {Tribun EU}, title = {Blooming Onion: Efficient Deduplication through Approximate Membership Testing}, url = {https://nlp.fi.muni.cz/raslan/2022/paper16.pdf}, year = {2022} }
TY - JOUR ID - 2240155 AU - Herman, Ondřej PY - 2022 TI - Blooming Onion: Efficient Deduplication through Approximate Membership Testing PB - Tribun EU CY - Brno SN - 9788026317524 KW - deduplication KW - text corpora KW - Bloom filter UR - https://nlp.fi.muni.cz/raslan/2022/paper16.pdf N2 - Deduplication of source text is an important step in corpus building. Maximum corpus sizes have been grown significantly, along with the requirements for computing resources required for processing them. This article explores reducing the cost of deduplication by applying approximate membership testing using Bloom filtering. ER -
HERMAN, Ondřej. Blooming Onion: Efficient Deduplication through Approximate Membership Testing. In Aleš Horák, Pavel Rychlý, Adam Rambousek. \textit{Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022}. Brno: Tribun EU, 2022, s.~91-95. ISBN~978-80-263-1752-4.
|