Fast Similarity Searching of Text Documents using Learned Metric Index

Žovák, Jakub

Bakalářská práce

Fast Similarity Searching of Text Documents using Learned Metric Index

Jakub Žovák

Anotace

Textové dokumenty, ako sú blogy, statusy na sociálnych sieťach, spravodajské články, eseje a textové správy, predstavujú jeden z hlavných zdrojov informácií na internete. Preto je mimoriadne dôležité takéto dáta efektívne indexovať a vyhľadávať. Keďže sú však textové objekty rozsiahle a komplexné, hľadanie presnej zhody je prakticky nemožné. Preto sa tieto objekty musia vyhľadávať na základe podobnosti …víceméně

Abstract

Text documents such as blog posts, tweets, news articles, essays, and text messages, represent one of the primary sources of information on the internet. Therefore, it is paramount to index and search such data efficiently. However, since these objects are large and complex, searching for an exact match is practically impossible. Therefore, text objects must be searched based on the notion of similarity …víceméně

Klíčová slova

similarity searching learned indexes learned metric index machine learning NLP text similarity

Zadání práce

Searching in texts is still an open challenge. One of the viable approaches to fast and practical text browsing is similarity searching -- we can define a similarity function that determines the similarity between each pair of words, sentences or even whole documents. In 2018, a paper called The Case for Learned Index Structures has been published, arguing for a new paradigm for organizing and searching within complex data using machine learning. The goal of this thesis is to apply such an approach to the problem of similarity searching in text data and evaluate the results. Firstly, the student will have to get familiar with a great variety of approaches to text similarity, both lexical and semantic. Second, he will process these approaches for machine learning. Next, the text data will need to be indexed using an existing framework called Learned Metric Index (LMI) -- since the framework has never been used with this type of data, it will be necessary to identify the distinctive characteristics of text data and modify the setup of LMI to appropriately represent similarity within the text datasets. Finally, the searching efficiency of the resulting index will be evaluated experimentally.

Administrativní informace

Práce zkontrolována:
24. 5. 2022 09:06, RNDr. Matej Antol, Ph.D., učo 325040

Zadáno/změněno 30. 6. 2022 10:07, Miroslava Tomíčková, učo 114718
Záznam založen 3. 5. 2022 10:06, Jana Zemanová, učo 9619
Zveřejnit od 19. 5. 2022 12:25, Lucie Wagnerová, učo 119715
Práce převzata 19. 5. 2022 12:25, Lucie Wagnerová, učo 119715

Plný text práce

2,4 MB / soubor PDF

Přílohy (1)

Přílohy

bachelor-thesis-resources.rar

Příloha

Jazyk práce

angličtina

Termín obhajoby

29. 6. 2022

Práce byla úspěšně obhájena

Vedoucí

RNDr. Matej Antol, Ph.D., učo 325040
CERIT SC ÚVT MU

Posudek vedoucího

Oponent

RNDr. Miriama Jánošová, učo 424615
KSUZD FI MU

Posudek oponenta

Konzultant

RNDr. Terézia Slanináková, Ph.D., učo 445526
KSUZD FI MU

Citovat tuto práci

Citace dle normy ČSN ISO 690

ŽOVÁK, Jakub. Fast Similarity Searching of Text Documents using Learned Metric Index. Online. Bakalářská práce. Brno: Masarykova univerzita, Fakulta informatiky. 2022. Dostupné z: https://is.muni.cz/th/wmtet/.

@misc{Zovak2022thesis, AUTHOR = {Žovák, Jakub}, TITLE = {Fast Similarity Searching of Text Documents using Learned Metric Index}, YEAR = {2022}, TYPE = {Bakalářská práce}, INSTITUTION = {Masarykova univerzita, Fakulta informatiky}, LOCATION = {Brno}, SUPERVISOR = {Matej Antol}, URL = {https://is.muni.cz/th/wmtet/}, URL_DATE = {2026-07-16}, }

{{Citace kvalifikační práce | příjmení = Žovák | jméno = Jakub | instituce = Masarykova univerzita, Fakulta informatiky | odkaz na instituci = Fakulta informatiky Masarykovy univerzity | titul = Fast Similarity Searching of Text Documents using Learned Metric Index | url = https://is.muni.cz/th/wmtet/ | typ práce = Bakalářská práce | vedoucí = Matej Antol | odkaz na vedoucího = {{UČO na článek|325040}} | místo = Brno | rok = 2022 | počet stran = | strany = | citace = 2026-07-16 | poznámka = | jazyk = en }}

Masarykova univerzita Fakulta informatiky

Studijní program

Informatika

Plán

Informatika

Práce na příbuzné téma

Seznam prací, které mají shodná klíčová slova.

Implementace Learned Metric Index

RNDr. Terézia Slanináková, Ph.D., učo 445526
Implementation of Unsupervised Learned Metric Index

Mgr. Vojtěch Kaňa
Indexing Data Using Machine Learning

Mgr. Jakub Hanko
Enhancing Performance of Learned Metric Index for Indexing Large Datasets

Bc. Jozef Čerňanský
Application of machine learning to searching in unstructured data

RNDr. Terézia Slanináková, Ph.D., učo 445526
Propaganda Detection using Stylometric Text Analysis

RNDr. Radoslav Sabol, učo 469331
Propojení pojmenovaných entit získaných z českých biomedicínských textů se standardními slovníky

Mgr. Filip Gregora
Analysis of use of AI systems in writing final theses at FI MU

Ing. David Černý

Podobné práce

Název

Vložil

Vloženo

Práva

Archiv závěrečné práce Jakub Žovák FI B-INF IN wmtet/7

Zemanová, J.

3. 5. 2022

Složky

Soubory

Anotace anglicky annotation_english.txt

Žovák, J.

16. 5. 2022

Anotace česky annotation.txt

Žovák, J.

18. 5. 2022

Klíčová slova keywords.txt

Žovák, J.

17. 5. 2022

Plný text práce Fast_Similarity_Searching_of_Text_Documents_using_Learned_Metric_Index.pdf

Žovák, J.

18. 5. 2022

Posudek oponenta posudek_oponenta_Janosova.pdf

Jánošová, M.

16. 6. 2022

Posudek vedoucího posudek_vedouciho_Antol.pdf

Antol, M.

17. 6. 2022

Příloha bachelor-thesis-resources.rar

Žovák, J.

18. 5. 2022

Přidání souboru

Soubor nebo složku lze nahrát pomocí tlačítka Přidat.
Další operace se soubory

Podrobnosti lze zjistit označením příslušného řádku.
Pohled pro experty

Pro častou práci je možné zvolit režim Více možností.
Vyhledávání souborů

Vyhledávaný výraz můžete zadat přímo do adresního řádku.
Rychlý přístup k souborům

Pomocí funkce Nedávné je možné se rychle vrátit k právě prohlíženým souborům. Oblíbené soubory je také možné označit Hvězdičkou.

Závěrečná práce: Jakub Žovák: Fast Similarity Searching of Text Documents using Learned Metric Index

Bakalářská práce

Fast Similarity Searching of Text Documents using Learned Metric Index

Anotace

Abstract

Klíčová slova

Zadání práce

Přílohy

bachelor-thesis-resources.rar

Vedoucí

Oponent

Konzultant

Citace dle normy ČSN ISO 690

Práce na příbuzné téma

Složky

Soubory

Přidání souboru

Další operace se soubory

Pohled pro experty

Vyhledávání souborů

Rychlý přístup k souborům