Math Information Retrieval for Digital Libraries

u 2017

Math Information Retrieval for Digital Libraries

RŮŽIČKA, Michal

Základní údaje

Originální název

Math Information Retrieval for Digital Libraries

Autoři

RŮŽIČKA, Michal (203 Česká republika, garant, domácí)

Vydání

Brno, 144 s. Doctoral Thesis, 2017

Nakladatel

Masaryk University, Faculty of Informatics

Další údaje

Jazyk

angličtina

Typ výsledku

Účelové publikace

Obor

10201 Computer sciences, information science, bioinformatics

Stát vydavatele

Česká republika

Utajení

není předmětem státního či obchodního tajemství

Odkazy

Full-text Archive of Thesis

Kód RIV

RIV/00216224:14330/17:00094408

Organizační jednotka

Fakulta informatiky

Klíčová slova anglicky

math information retrieval; math-aware full-text search; similarity search; digital libraries; MIaS; WebMIaS; MathML canonicalization; MathML structural unification; query relaxation strategies; topic modeling; math representation; evaluation

Příznaky

Mezinárodní význam

Změněno: 10. 1. 2018 14:12, RNDr. Michal Růžička, Ph.D.

Anotace

V originále

The aim of my thesis is to improve full-text search functionality in the digital libraries of scientific documents. Documents in STEM (science, technology, engineering and mathematics) fields usually contain a lot of mathematical formulae which are germane to the main message of the documents. However, current, common full-text search engines provide neither the proper functionality to index formulae nor the appropriate interfaces for end users to search them. In essence, our math-aware full-text search engine MIaS (Math Indexer and Searcher) exploits the search functionality of the state-of-the-art keyword-based full-text search engines, while adding a new and efficient way of indexing formulae. To index mathematical formulae, the formulae are processed in several steps: formulae are normalized to remove the syntactic differences between formulae which are semantically the same; subformulae and generalized forms of the formulae are derived to be represented (and searchable) as standalone entities in the index; weights are assigned to them based on the degree of the modification from the original formula; finally, formulae are converted to a compact text representation called MTerms that are directly indexable by the common text search engines. During the search, similar processing is applied to formulae in the query resulting in query MTerms usable for querying the index. Formulae weights in the index are used to assign proper ranks to the results. For queries combining multiple text keywords and mathematical formulae I developed and evaluated multiple query relaxation strategies that improve the quality of the results. The suitability of various math representations for topic modeling on math documents was evaluated so as to make further improvements to the ranking of results in the math-aware search engine. Using the math normalization tool and query relaxation strategies in concert resulted in MIaS being independently evaluated as the system providing the best results on an NTCIR-11 evaluation event from among seven other math-aware full-text systems. MIaS has now been integrated to the European Digital Mathematics Library (EuDML) to provide math-aware search functionality in hundreds of thousands math papers, as the first ever industry standard digital library. This thesis evaluates the effects of various configuration parameter setups of the above-mentioned techniques of math-specific contents (formulae) representation and indexing. The evaluation process is automated using datasets and ground truth data from the NTCIR evaluation events. The automated evaluation methodology, exploiting existing judgement data, described in the thesis can also be applied to the rigorous evaluation of other systems.

Návaznosti

TD03000295, projekt VaV

Název: Inteligentní software pro sémantické hledání dokumentů (Akronym: ISSHD)

Investor: Technologická agentura ČR, Inteligentní software pro sémantické hledání dokumentů

1ET200190513, projekt VaV

Název: DML-CZ: Česká digitální matematická knihovna

Investor: Akademie věd ČR, DML-CZ: Česká digitální matematická knihovna

250503, interní kód MU

Název: The European Digital Mathematics Library (Akronym: EuDML)

Investor: Evropská unie, The European Digital Mathematics Library

Citovat

RŮŽIČKA, Michal. Math Information Retrieval for Digital Libraries. Brno: Masaryk University, Faculty of Informatics, 2017, 144 s. Doctoral Thesis.

@misc{1401941,
   author = {Růžička, Michal},
   address = {Brno},
   keywords = {math information retrieval; math-aware full-text search; similarity search; digital libraries; MIaS; WebMIaS; MathML canonicalization; MathML structural unification; query relaxation strategies; topic modeling; math representation; evaluation},
   language = {eng},
   location = {Brno},
   publisher = {Masaryk University, Faculty of Informatics},
   title = {Math Information Retrieval for Digital Libraries},
   url = {https://is.muni.cz/th/143424/fi_d/doctoral-thesis.pdf},
   year = {2017}
}

TY  - GEN
ID  - 1401941
AU  - Růžička, Michal
PY  - 2017
TI  - Math Information Retrieval for Digital Libraries
VL  - Doctoral Thesis
PB  - Masaryk University, Faculty of Informatics
CY  - Brno
KW  - math information retrieval
KW  - math-aware full-text search
KW  - similarity search
KW  - digital libraries
KW  - MIaS
KW  - WebMIaS
KW  - MathML canonicalization
KW  - MathML structural unification
KW  - query relaxation strategies
KW  - topic modeling
KW  - math representation
KW  - evaluation
UR  - https://is.muni.cz/th/143424/fi_d/doctoral-thesis.pdf
L2  - https://is.muni.cz/th/143424/fi_d/?lang=en
N2  - The aim of my thesis is to improve full-text search functionality in the digital libraries of scientific documents. Documents in STEM (science, technology, engineering and mathematics) fields usually contain a lot of mathematical formulae which are germane to the main message of the documents. However, current, common full-text search engines provide neither the proper functionality to index formulae nor the appropriate interfaces for end users to search them. In essence, our math-aware full-text search engine MIaS (Math Indexer and Searcher) exploits the search functionality of the state-of-the-art keyword-based full-text search engines, while adding a new and efficient way of indexing formulae. To index mathematical formulae, the formulae are processed in several steps: formulae are normalized to remove the syntactic differences between formulae which are semantically the same; subformulae and generalized forms of the formulae are derived to be represented (and searchable) as standalone entities in the index; weights are assigned to them based on the degree of the modification from the original formula; finally, formulae are converted to a compact text representation called MTerms that are directly indexable by the common text search engines. During the search, similar processing is applied to formulae in the query resulting in query MTerms usable for querying the index. Formulae weights in the index are used to assign proper ranks to the results. For queries combining multiple text keywords and mathematical formulae I developed and evaluated multiple query relaxation strategies that improve the quality of the results. The suitability of various math representations for topic modeling on math documents was evaluated so as to make further improvements to the ranking of results in the math-aware search engine. Using the math normalization tool and query relaxation strategies in concert resulted in MIaS being independently evaluated as the system providing the best results on an NTCIR-11 evaluation event from among seven other math-aware full-text systems. MIaS has now been integrated to the European Digital Mathematics Library (EuDML) to provide math-aware search functionality in hundreds of thousands math papers, as the first ever industry standard digital library. This thesis evaluates the effects of various configuration parameter setups of the above-mentioned techniques of math-specific contents (formulae) representation and indexing. The evaluation process is automated using datasets and ground truth data from the NTCIR evaluation events. The automated evaluation methodology, exploiting existing judgement data, described in the thesis can also be applied to the rigorous evaluation of other systems.
ER  -

RŮŽIČKA, Michal. \textit{Math Information Retrieval for Digital Libraries}. Brno: Masaryk University, Faculty of Informatics, 2017, 144 s. Doctoral Thesis.

Podrobný výpis o publikaci