Indexing and Searching Mathematics in Digital Libraries Architecture, Design and Scalability Issues Petr Sojka, Martin Líška Masaryk University, Faculty of Informatics, Botanická 68a, 602 00 Brno, Czech Republic sojka@fi.muni.cz, 255768@mail.muni. cz Abstract. This paper surveys approaches and systems for searching mathematical formulae in mathematical corpora and on the web. The design and architecture of our MlaS (Math Indexer and Searcher) system is presented, and our design decisions are discussed in detail. An approach based on Presentation MathML using a similarity of math subformulae is suggested and verified by implementing it as a math-aware search engine based on the state-of-the-art system, Apache Lucene. Scalability issues were checked based on 324,000 real scientific documents from arXiv archive with 112 million mathematical formulae. More than two billions MathML subformulae were indexed using our Solr-compatible Lucene extension. Keywords: math indexing and retrieval, mathematical digital libraries, information systems, information retrieval, mathematical content search, document ranking of mathematical papers, math text mining, MlaS, WebMIaS 1 Introduction I do not seek. I find. Pablo Picasso The solution to the problem of mathematical formulae retrieval lies at the heart of building digital mathematical libraries (DML). There have been numerous attempts to solve this problem, but none have found widespread adoption and satisfaction within the wider mathematics community. And as yet, there is no widely accepted agreement on the math search format to be used for mathematical formulae by library systems or by Google Scholar. MathML standard by W3C has become the standard for mathematics exchange between software tools. Almost no MathML is written directly by authors—they typically prefer a compact notation of some TpX flavour such as ETpX or ^j^-W^X.. The designer of a search system for mathematics is thus faced with the task of converting data to a unifying format, and allowing DML users to use their prefered notation when posing queries. [J^y^SJL^TrjX or omer t^X flavour are the typical preferences; Presentation MathML or Content MathML are used only when available as outputs of a software system. During the integration of existing DMLs into larger projects such as EuDML [15], the unsolved math search problem becomes evident—DML without math search support J.H. Davenport et al. (Eds.): Calculemus/MKM 2011, LNAI 6824, pp. 228-243, 2011. © Springer-Verlag Berlin Heidelberg 2011 Indexing and Searching Mathematics in Digital Libraries 229 is an oxymoron. As our subject matter search has not lead to a satisfactory solution, we have designed and implemented [7] new robust solutions for retrieval of mathematical formulae. Section 2 explores published facts about research done in the area of mathematics retrieval. Pros and cons of existing approaches are outlined, most of them being neither applicable nor satisfactory for digital library deployment. In Section 3 we present our design of scalable and extensible system for searching mathematics, taking into account not only inherent structure of mathematical formulae but also formula unification and subformulae similarity measures. Our evaluation of prototypical implementation above the Apache Lucene open source full-featured search engine library is presented in Section 4. The paper closes listing future work directions in Section 5 and a conclusion is summarised in Section 6. Computers are useless. They can only give you answers. Pablo Picasso 2 Approaches to Searching Mathematics A great deal of research on has been already undertaken on searching mathematical formulae in digital libraries and on the web. Several such Mathematical Search Engines (MSE) have been designed in the past: MathDex, EgoMath, ETEXSearch, LeActiveMath or MathWebSearch. In this section, we will briefly comment on each of these. MathDex1 (formerly MathFind [9]) is a result of a NSF-funded project headed by Robert Miner of Design Science2. It encodes mathematics as text tokens, and uses Apache Lucene as if searching for text. Using similarity with search terms, ranked results are produced by the search algorithm, matching n-grams of presentation MathML. The creators of MathDex report that most of the work was due to a necessary and extensive normalization of MathML—because of the fact that it uses several converters and filters to convert to XHTML + MathML—HTML (jtidy), TeX/ETeX (blahtex, ETeXML, Hermes), Word (Word+MathType), PDF (pdf2tiff+Infty). The algorithm of n-gram ranking has several drawbacks. For one thing, it cannot take many kinds of elementary mathematical equivalences into account, and it puts undue weight on variable names. Contrary to its intentions, MathDex has not become a sustainable service to the mathematical community, although it has fueled research in the area of mathematics searching [16,17,1]. EgoMath3 is being developed by Josef Misutka as an extension of a full text websearch core engine Egothor (by Leo Galambos, MFF UK Prague) [8] licenced under GPL. It uses presentation MathML for indexing and develops generalization algorithms and relevancy calculation to cope with normalization. As part of EgoThor evaluation, an MSE evaluation dataset is also being developed.4 1 www.mathdex.com/ 2 www.ima.umn.edu/2006-2007/SW12.8-9.06/activities/Miner-Robert/index.html 3 egomath.projekty.ms.mff.cuni.cz/egomath/ 4 egomath.cythres.cz/dataset.py 230 Petr Sojka, Martin Liška BTjjXSearch5 is a search tool offered by Springer in SpringerLink. It searches directly in the TpX math string representations as provided by the authors of papers submitted to Springer in fflgX sources. Some kind of text similarity matching is probably used. Since it is not open source, one can only guess the strategy for posing queries. Our experiments typically lead to a very low precision. Neither is there any definition of the article dataset available. LeActiveMath6 search has been developed as part of the ActiveMath-EU project. It is Lucene based, indexing string tokens from OMDoc with an OpenMath semantic notation. The document database format is internal since only documents authored for LeActiveMath learning environments are indexed. Math Web Search7 is an MSE developed in Bremen/Saarbriicken by Kohlhase et al. [2] It is not based on full text searching, rather it adopts a semantic approach: it uses substitution trees in memory. Both presentation and content MathML is supported, together with OpenMath. It is exceptional in the fact that it primarily deals with semantics and uses its own engine, not being built on the Lucene engine, for math. Further development is now being pursued under LaMaPun architecture [6]. The comparison of math search systems is summarized in Table 1. All of the MSEs reviewed had some drawbacks regarding their employment in a digital mathematical library such as EuDML. This was our main motivation for designing a new one, primarily for the use in large scale libraries, such as EuDML or ArXiv. Everything you can imagine is real. 3 Design Of MlaS Pablo Picasso We have developed a math-aware, full-text based search engine called MlaS (Math Indexer and Searcher). It processes documents containing mathematical notation in MathML format. MlaS allows users to search for mathematical formulae as well as the textual content of documents. Since mathematical expressions are highly structured and have no canonical form, our system pre-processes formulae in several steps to facilitate a greater possibility of matching two equal expressions with different notation and/or non-equal, but similar formulae. With an analogy to natural language searching, MlaS searches not only for whole sentences (whole formulae), but also for single words and phrases (subformulae down to single variables, symbols, constants, etc.). For calculating the relevance of the matched expressions to the user's query, MlaS uses a heuristic weighting of indexed terms, which accordingly affects scores of matched documents and thus the order of results. 3.1 System Workflow The top-level indexing scheme is shown in Figure 1 on page 232. A detailed view of the mathematical part is shown in Figure 2 on page 233. s www.latexsearch.com/ 6 devdemo.activemath.org/ActiveMath2/ 7 search.mathweb.org/index.xhtml Table 1: Comparison of math search systems System Input documents Internal representation Approach a-eq. Query language Queries Indexing core MathDex HTML, TeX/KTeX, Word, PDF Presentation MathML (as strings) syntactic X 7 text, math, mixed Apache Lucene LeActiveMath OMDoc, OpenMath OpenMath (as string) syntactic X OpenMath (palette editor) text, math, mixed Apache Lucene KTrÄSearch ETeX LTeX (as string) syntactic X LTeX titles, math, DOI 7 MathWeb Search Presentation MathML, Content MathML, OpenMath Content MathML, Open-Math (substitution trees) semantic ✓ QMath, BTeX, Mathematica, Maxima, Maple, Yacas styles (palette editor) text, math, mixed Apache Lucene (for text only) EgoMath Presentation MathML, Content MathML, PDF Presentation MathML trees (as strings) mixed ✓ lATpX text, math, mixed EgoThor MlaS any (well-formed) MathML Canonical Presentation MathML trees (as compacted strings) math tree similarity/ normalization ✓ ■^VtS-IATpX or MathML text, math, mixed Apache Lucene/Solr 232 Petr Sojka, Martin Liška 3.2 Indexing MlaS is currently able to index documents in XHTML, HTML and TXT formats. As Figure 1 shows, the input document is first split into textual and mathematical parts. The textual content is indexed in a conventional way. Mathematical expressions, on the other hand, are pre-analyzed in several steps to facilitate searches not only for exact whole formulae, but also for subparts (tokenization) and for similar expressions (formulae modifications). This addresses the issue of the static character of full-text search engines and creates several representations of each input formula all of which are indexed. Each indexed mathematical expression has a weight (relevancy score) assigned to it. It is computed throughout the whole indexing phase by individual processing steps following this basic rule of thumb—the more modified a formula and the lower the level of a subformula, the less weight is assigned to it. At the end of all processing methods, formulae are converted from XML nodes to a compacted linear string form, which can be handled by the indexing core. Start and end XML tags are substituted by the tag name followed by an argument embraced by opening and closing parentheses. This creates abbreviated but still unambiguous representation of each XML node. For example, formula a + b2, in MathML written as: Indexing and Searching Mathematics in Digital Libraries 233 a + b2 is converted to "math(mrow(mi(a)mo(+)msup(mi(b)mn(2))))" and this string is then indexed by Lucene. 3.3 Tokenization Tokenization is a straightforward process of obtaining subformulae from an input formula. MlaS makes use of Presentation MathML markup where all logical units are enclosed in XML tags which makes obtaining all subformulae a question of tree traversal. The inner representation of each formula is an XML node encapsulating all the member child nodes. This means the highest level formula—as it appears in the input document— is represented by a node named "math". All logical subparts of an input formula are obtained and passed on to modification algorithms. 234 Petr Sojka, Martin Liška 3.4 Formulae Modifications MlaS performs three types of unification algorithms, the goal of which is to create several more or less generalized representations of all formulae obtained through the tokenization process. These steps allow the system to return similar matches to the user query while preserving the formula structure and a-equality. 3.5 Ordering Let us take a simple example: a + 3 and the query 3+a. This would not match even though it is perfectly equal. This is why a simple ordering of the operands of the commutative operations, addition and multiplication, is used. It tries to order arguments of these operations in the alphabetical order of the XML nodes denoting the operands whenever possible—it considers the priority of other relevant operators in the formula. The system applies this function to the formula being indexed as well as to the query expression. Applied to the example above, the XML node denoting variable a is named "mi", the node denoting number 3 is named "mn". "mi"<"mn" therefore 3 + a would be exchanged for a + 3 and would match. 3.6 Unification of Variables Let us take another example: a + ba and x+yx. Again, these would not match even though the difference is only in the variables used. MlaS employs a process that unifies variables in expressions while taking bound variables into account. All variables are substituted for unified symbols (ids) in both the indexing and searching phases. Applied to the example, both expressions would unify to idi + id!^1 and would match. This process is not applied to single symbols—this would lead to the indexing of millions of ids and searching for any symbol would end up matching all of the documents containing it. 3.7 Unification of Constants This is a strightforward process of substituting all the numerical constants for one unified symbol (const). This obviates the need for the exact values of constants in user queries. In some situations however, this can be too much of a generalization. As well as in the case of the variables, stand-alone numerical constants are not unified for the same obvious reason. 3.8 Formulae Weighting During the searching phase, a query can match several terms in the index. However one match can be more important to the query than another, and the system must consider this information when scoring matched documents. For mathematical formulae the system makes use of the processing operations described above since they all produce expressions more generalized than the input ones. It is impossible to assemble a weighting function that is exactly right. Such a function should consider a document base on which the system will run as well as the established Indexing and Searching Mathematics in Digital Libraries 235 customs in a particular scientific field. We tried to create a complex and robust weighting function that would be appropriate to many fields. The original unchanged untokenized formula should of course have the greatest weight, but the precision of the ordered representation is not compromised at all, so it should have the same weight. In fact, if the ordering process changes the order of some members in an expression, the original formula is not indexed at all. The starting weight for such a representation is 1. The tokenization process should naturally lower the weight of the subformulae since they are deeper in the structure and therefore less important to the overall formula. When a user who is searching for a + b finds two documents, the first containing a + b and the second containing the first should score more and appear higher in the results, as it matches in higher level of MathML expression tree. Hence the tokenization process reduces the weight of the subformulae according to the level coefficient I < 1. Both unification algorithms produce representations that are more generalized than their input expressions. They have a higher probability of matching, and should therefore score less. The unification of variables alters the weight of the result formula by coefficient v < 1, unification of number constants uses coefficient c < 1. Theoretically, two equally unified subformulae matched on the same level of differently complex parent formulae would have the same score. For example a + b3+a and oo J25b2db o d — e --+-+ 100a/? 3 + a b with the query 3 + a. Both matches are not unified, and both are found on the third level. Analogously to conventional full-text engines which discriminate documents with more tokens than others, we use information about the complexity of parent formulae. More specifically, an initial weight of 1 is multiplied by the inverse number of nodes of a whole parent expression. According to this model, each formula has a weight attribute indexed alongside itself, which belongs to the interval (0,1). Weight w of the sub formula contained on a certain level in a parent formula with the number of nodes («) can be calculated in particular situations as follows: - no changes made: w = ^'(1+v+c+vc) - unified variables: w — -—(-v+vc) n - unified constants: w — -—('c+vc) n - unified both variables and constants: w — -—— . n See Section 3.9 for details. To fine tune the weighting parameters, we developed a tool with verbose output in which the behavior of the model can be observed and tested. A sample from the tool mentioned above is shown in Table 2 on page 237. 236 Petr Sojka, Martin Liška input: arranged: (a + b2*c, 0.125) (a + bc*2, 0.125) tokenization: variables unification: constants unification: p (a, 0.0875) //(+, 0.0875) (bc*2, 0.0875) lb, 0.06125 (e, 0.042875) / ', \ (2,0.042875) (+,0.042815) \ / [id^+id^^2, 0.1) (ídf'+2,0.07) (id , + 2,0.0343) (a + bc*cm", 0.0625) / (/>""""', 0.04375) (e + eonsř, 0.030625)', (idj+iď*'*™"", 0.05) (íd'/''+OT"', 0.035) (id^ + const, 0.01715) Fig. 3: Example of formula preprocessing. Ordered pairs are (). All expressions as shown are indexed, except for the original one. We have come to the conclusion that the unification of variables interferes less with original formula meaning than the unification of number constants. For this reason, its coefficient should be higher—i.e., less discriminating. The main question then became, how discriminating the level coefficient should be. Our empirical deduction is that going deeper in a structural tree should be discriminating, the precise match on a lower level should still score more than any unified formula on the level above, as could be seen in Table 2: ^ (row 5) is an exact match on the second level and its score is higher than unified expressions matched on the first level (rows 2, 3 and 4). This led us to the valuation of level weighting coefficient I — 0.7, unification weighting coefficient v = 0.8 and constant weighting coefficient c — 0.5. In Figure 3 the whole formula preprocessing process is illustrated together with its subformulae weightings. 3.9 Searching In the search phase, user input is again split into mathematical and textual parts. Formulae are then reprocessed in the same way as in the indexing phase, except for tokenization— which we doubt that users are likely to query, for example — wanting to find documents only with occurrences of variable c. That means the queried expressions are first ordered, then unified. This produces several representations which are connected to the final query by the logical OR operator. Textual query terms are connected to the final query by the logical AND operator. Therefore by specifying a text term we can narrow down the results, because each Indexing and Searching Mathematics in Digital Libraries 237 Table 2: Example of weighting function on several formulae. Original query is a + 3—all queried expressions are a + 3, idi + 3, a + const, idi + const. Formula Indexed Expressions Score Matched a + 3 0.25=[a + 3], 0.2=[idi + 3], 0.175=[a, 3, +], 0.125=[a + const], 0.1=[id! + const] 2.7 0.1 [idi + const] + 0.25[a + 3] + 0.2[id! + 3] + 0.125[a + const] b + 3 0.25=[& + 3], 0.2=[idi + 3], 0.175=[2>, +, 3], 0.125=[& + const], 0.1=[id! + const] 1.2 0.1 [id! + const] +0.2[id! +3] a + 5 0.25=[a + 5], 0.2=[id! + 5], 0.175=[a, +, 5], 0.125=[a + const], 0.1=[id! + const] 0.9 0.1[id! + const] + 0.125[a + const] c + 10 0.25=[c + 10], 0.2=[id! + 10], 0.175=[c,+, 10], 0.125=[c + const], 0.1=[id! + const] 0.4 0.1 [id! + const] l a+3 0.16667=[^3], Q.13334^-^], 0.11667=[l,a + 3], 0.09334=^! + 3], °-08334=[;^fi;]< 0.08167=[+, 3, a], 0.06667=[..CI"'st ,], 0.05833=[a + const], L idi +const-" L 0.04667=[idi + const] 1.26 0.04667[id! + const] + 0.11667[a + 3] + 0.09334[id! + 3] + 0.05833[a + const] 1 b+3 0.16667=[^],0.13334=[-r^], 0.11667=[2> + 3,1], 0.09334=[id! + 3], °-08334=[(^fi;]' 0.08167=[fc, 3, +], 0.06667=[..CI"'st ,], 0.05833=[fc + const], L idi +const-" L 0.04667=[idi + const] 0.56 0.04667[id! + const] + 0.09334^ + 3] 1 a+5 0.16667=[^], 0.13334=[-^], 0.11667=[l,a + 5], 0.09334=^! + 5], 0.08334=[^y, 0.08167=[a, 5, +], 0.06667=[a/°n"St ,], 0.05833=[a + const], L idi +const-" L 0.04667=[idi + const] 0.42 0.04667[id! + const] + 0.05833[a + const] 1 c+10 0.16667=^], 0.13334^-j^], 0.11667=[l,c + 10], 0.09334=[id! + 10], 0.08334=[-ssM, 0.08167=[+,c, 10], 0.06667=[..CI"'st ,], 0.05833=[c + const], L ldj +constJ' L J' 0.04667=[idi + const] 0.19 0.04667[id! + const] returned document must have the term contained. When more than one text term is specified, they are implicitly connected to the text query by the OR operator which means at least one term should occur in the result. We can also explicitly state preferences about each text term—whether it needs to occur in the result or not. As stated above, the final query, without having explicitly stated occurrences of text terms, is in the logical form of (formulai V ... V formulan) A (termi V ... V termn). In order to counterbalance the weight of the textual and mathematical parts of the query, the score of the matched formulae are additionally multiplied by number of nodes the matching query consists of. This results in more complex mathematical queries scoring more. 238 Petr Sojka, Martin Liška A very positive value has its price in negative terms... the genius of Einstein leads to Hiroshima. Pablo Picasso 4 Evaluation For large scale evaluation, we needed an experimental implementation and a corpus of mathematical texts. 4.1 Implementation The Math Indexer and Searcher is written in Java. The role of full-text indexing and searching core is performed by Apache Lucene 3.1.0. The mathematical part of document processing can be seen as a standalone pluggable extension to any full-text library, however it would need custom integration for each one. In the case of Lucene, a custom Tokenizer (MathTokenizer) has been implemented. For the textual content of documents, Lucene's StandardAnalyzer is employed. In MathTokenizer, TermAttributes are used for carrying strings of math expressions and PayloadAttribute for storing weights of formulae. The question now is, how should the weights of formulae be taken into consideration in the overall score of matched documents. Lucene's practical scoring function for every hit document d by query q with each query term t is as follows: score(q, d) — coord(q, d)-queryNorm(q)^y' (tf(t in d) ■ idf(t)2 ■ t.getBoostQ ■ norm(t,