ScaleText: The Design of a Scalable, Adaptable and
User-Friendly Document System for Similarity Searches : Digging
for Nuggets of Wisdom in Text

D 2016

ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text

RYGL, Jan, Petr SOJKA, Michal RŮŽIČKA and Radim ŘEHŮŘEK

Basic information

Original name

ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text

Authors

RYGL, Jan (203 Czech Republic), Petr SOJKA (203 Czech Republic, guarantor, belonging to the institution), Michal RŮŽIČKA (203 Czech Republic, belonging to the institution) and Radim ŘEHŮŘEK (203 Czech Republic)

Edition

Brno, Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016, p. 79-87, 9 pp. 2016

Publisher

Tribun EU

Other information

Language

English

Type of outcome

Stať ve sborníku

Field of Study

10201 Computer sciences, information science, bioinformatics

Country of publisher

Czech Republic

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

printed version "print"

References:

Domovská stránka workshopu preprint

RIV identification code

RIV/00216224:14330/16:00087632

Organization unit

Faculty of Informatics

ISBN

978-80-263-1095-2

ISSN

UT WoS

000466886400009

Keywords (in Czech)

ScaleText; modelování vektorovým prostorem; latentní sémantické indexování; LSI; strojové učení; škálovatelné vyhledávání; návrh vyhledávače; dolování textu

Keywords in English

ScaleText; vector space modelling; Latent Semantic Indexing; LSI; machine learning; scalable search; search system design; text mining

Abstract

V originále

This paper describes the design of a new ScaleText system aimed at scalable semantic indexing of heterogeneous textual corpora. We discuss the design decisions that lead to a modular system architecture for indexing and searching using semantic vectors of document segments – nuggets of wisdom. The prototype system implementation is evaluated by applying Latent Semantic Indexing (LSI) on the Enron corpus. And the Bpref measure is used to automate comparing the performance of different algorithms and system configurations.

Links

MUNI/A/0892/2015, interní kód MU

Name: Výzkum v aplikované informatice na FI MU (Acronym: VAIFIMU)

Investor: Masaryk University, Category A

TD03000295, research and development project

Name: Inteligentní software pro sémantické hledání dokumentů (Acronym: ISSHD)

Investor: Technology Agency of the Czech Republic

Citovat

RYGL, Jan, Petr SOJKA, Michal RŮŽIČKA and Radim ŘEHŮŘEK. ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text. In Aleš Horák, Pavel Rychlý, Adam Rambousek. Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. Brno: Tribun EU, 2016, p. 79-87. ISBN 978-80-263-1095-2.

@inproceedings{1361540,
   author = {Rygl, Jan and Sojka, Petr and Růžička, Michal and Řehůřek, Radim},
   address = {Brno},
   booktitle = {Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016},
   editor = {Aleš Horák, Pavel Rychlý, Adam Rambousek},
   keywords = {ScaleText; vector space modelling; Latent Semantic Indexing; LSI; machine learning; scalable search; search system design; text mining},
   howpublished = {tištěná verze "print"},
   language = {eng},
   location = {Brno},
   isbn = {978-80-263-1095-2},
   pages = {79-87},
   publisher = {Tribun EU},
   title = {ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text},
   url = {http://raslan2016.nlp-consulting.net/},
   year = {2016}
}

TY  - JOUR
ID  - 1361540
AU  - Rygl, Jan - Sojka, Petr - Růžička, Michal - Řehůřek, Radim
PY  - 2016
TI  - ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text
PB  - Tribun EU
CY  - Brno
SN  - 9788026310952
KW  - ScaleText
KW  - vector space modelling
KW  - Latent Semantic Indexing
KW  - LSI
KW  - machine learning
KW  - scalable search
KW  - search system design
KW  - text mining
UR  - http://raslan2016.nlp-consulting.net/
L2  - http://www.fi.muni.cz/usr/sojka/papers/rygl-sojka-ruzicka-rehurek-raslan2016.pdf
N2  - This paper describes the design of a new ScaleText system aimed at scalable semantic indexing of heterogeneous textual corpora. We discuss the design decisions that lead to a modular system architecture for indexing and searching using semantic vectors of document segments – nuggets of wisdom. The prototype system implementation is evaluated by applying Latent Semantic Indexing (LSI) on the Enron corpus. And the Bpref measure is used to automate comparing the performance of different algorithms and system configurations.
ER  -

RYGL, Jan, Petr SOJKA, Michal RŮŽIČKA and Radim ŘEHŮŘEK. ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text. In Aleš Horák, Pavel Rychlý, Adam Rambousek. \textit{Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016}. Brno: Tribun EU, 2016, p.~79-87. ISBN~978-80-263-1095-2.

Detailed Information on Publication Record