Vector Space Representations in Information Retrieval

NOVOTNÝ, Vít. Vector Space Representations in Information Retrieval. Brno: Fakulta Informatiky Masarykovy Univerzity, 2017, 56 pp.

Other formats: BibTeX LaTeX RIS

Basic information
Original name	Vector Space Representations in Information Retrieval
Name in Czech	Vektorové reprezentace ve vyhledávání znalostí
Authors	NOVOTNÝ, Vít (203 Czech Republic, guarantor, belonging to the institution).
Edition	Brno, 56 pp. 2017.
Publisher	Fakulta Informatiky Masarykovy Univerzity

Other information
Original language	English
Type of outcome	Special-purpose publication
Field of Study	10201 Computer sciences, information science, bioinformatics
Country of publisher	Czech Republic
Confidentiality degree	is not subject to a state or trade secret
WWW	Full text Archiv závěrečné práce Soubory související se závěrečnou prací
RIV identification code	RIV/00216224:14330/17:00094402
Organization unit	Faculty of Informatics
Keywords in English	document segmentation; synonymy; question answering; vector space model; text retrieval; information retrieval
Tags	acl, gensim, scaletext
Tags	International impact
Changed by	Changed by: RNDr. Vít Starý Novotný, Ph.D., učo 409729. Changed: 1/11/2021 09:37.

Abstract

Modern text retrieval systems employ text segmentation during the indexing of documents. I show that, rather than returning the segments to the user, significant improvements are achieved on the semantic text similarity task by combining all segments from a single document into one result with an aggregate similarity score. Standard text retrieval methods underestimate the semantic similarity between documents that use synonymous terms. Latent semantic indexing tackles the problem by clustering frequently co-occuring terms at the cost of the periodical reindexing of dynamic document collections and the suboptimality of co-occurences as a measure of synonymy. I develop a term similarity model that suffers neither of these flaws.

Abstract (in Czech)

Moderní systémy pro hledání textu provádějí během vytváření databáze dokumentů segmentaci. V práci představuji postup, pomocí kterého lze během vyhledávání všechny segmenty jednoho dokumentu spojit a odvodit z nich podobnost dokumentu vůči uživatelovu dotazu. Běžné metody vyhledávání textu podceňují podobnost dokumentů, které používají rozdílnou terminologii. Latentní sémantická analýza tento problém řeší shlukováním slov, která se vyskytují dohromady. Cenou za toto řešení je však nutnost opětovně vytvářet databázi dokumentů u dynamicky se měnících kolekcí a neadekvátnost souvýskytů slov jakožto míry jejich vzájemné podobnosti. V práci představuji model, který netrpí ani jedním zmíněným nedostatkem.

Links
TD03000295, research and development project	Name: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)
TD03000295, research and development project	Investor: Technology Agency of the Czech Republic

PrintDisplayed: 9/10/2024 08:07

Vector Space Representations in Information Retrieval

Other applications