Normalization of Digital Mathematics Library Content MathML Canonicalization David Formánek, Martin Líška, Michal Růžička, and Petr Sojka Masaryk University, Faculty of Informatics Botanická 68a, 602 00 Brno, Czech Republic david.formanek@mail.muni.cz, martin.liski@mail.muni.cz, mruzicka@mail.muni.cz, sojka@fi.muni.cz Abstract. Paper discusses the needs for data normalization in a Digital Mathematics Library (DML). Specifically, emphasis is given to canonicalizing formulae encoded in Presentation MathML notation which starts to be available in several DMLs and is used by DML applications. This is a prerequisite for advanced processing — namely math enabled fulltext searching or semantic filtering and automated classification. Different sources of MathML and their specifics are described. Several use cases of possible formulae canonicalization transformations are listed and discussed in detail. Findings are finally concluded and a design of a to-be-developed canonicalization tool is outlined. Keywords: MathML normalization, canonicalization, digital mathematics libraries, DML, presentation MathML 1 Motivation Modern Digital Mathematics Libraries (DML) such as EuDML [18,5] base their services on paper semantics, i.e. fulltext handling, including mathematical formulae, as well as basic metadata and Mathematics Subject Classification (MSC) codes. Mathematics literature is widely dispersed across a high number of publishers, making it very difficult to collect fulltexts from these heterogeneous sources. This situation is very different from other libraries, such as PubMed Central for biomedical and life sciences, where publishers have an agreed workflow using the NLM Journal Publishing Tag Set and tools developed with funding from the National Institutes of Health. Full paper texts have to be ‘homogenized’, converted to some uniform representation, in order for math-aware full-text searches [15] and paper similarity computations [11,12] to work properly. These tasks are usually handled based on a bag-of-words representation of a document text — vector space model — every term (word, lemma) has its own dimension and the number of occurrences of a term reflects its value. Non-textual terms such as mathematical formulae are mostly not taken into account. This creates another challenge for DMLs, as 2 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka mathematical formulae are the essence of mathematical publications. There is an average of 380 mathematical formulae per arXiv paper in the MREC database [8]. It has been reported [21] that even a single histogram of mathematical symbols is sufficient for domain classification of a paper in the mathematical domain. To reliably represent a paper for DML processing, including handling the mathematics, it is necessary to 1. select a canonical representation of the non-textual structural entities appearing in fulltexts (mathematical symbols, formulae, and equations); and 2. decide on equivalence classes for these entities (e.g., for which formulae should be considered equal for given DML tasks such as search, similarity computation, formulae editing, and conversion of math into Braille). In this paper, we discuss the options for selecting the canonical representations of formulae to be used in DML tools, and the canonicalization process — the process — of computing this canonical representation from a variety of different sources and formats. Our primary motivation is the natural requirement for our own (Web)MIaS system, which currently uses Presentation MathML [14] to operate correctly and offer an expected search behaviour to users regardless of the MathML input source. When a user posts a query to the system, the system must abstract it from the underlying notational differences in order for it to behave correctly. This requirement is increasingly emphasized with the growing number of different sources of MathML. Currently there are three sources (LATEXML, Tralics, and user input; the number is expected to increase). If they are not correctly normalized the system misbehaves and it appears to users as if it simply does not work, however good the underlying design is. We have used UMCL library [1,2] for canonicalization in our MIaS system sofar. However, we have found that the deficiencies of the software are so severe (change of formulae semantics, slowness,...) [7, chapter 5], and the need for canonicalization so important, that we have decided to design and implement new canonicalization tool from scratch. This paper is structured as follows: in Section 2, different sources of mathematics are described and their differences are discussed. The core part of this paper is Section 3, where several use cases of possible canonical representation and canonicalization are documented and suggested. We conclude with Section 5, and present a plan for future work. 2 MathML Sources To store mathematical formulae in our documents we have chosen MathML1 — an XML-based language — as a widely used, formally defined, but still evolving standard. The widespread use of MathML and its XML base means of this 1 More precisely, Presentation MathML, as there are currently significantly more real-life resources using this form of MathML than Content MathML. Normalization of Digital Mathematics Library Content 3 language is supported by various tools in the whole document workflow. More importantly, MathML can be used as a common language among the advanced computer mathematical software packages that are extensively used by working mathematicians. On the author end of the document workflow the MathML code can be ‘hand made’ using simple plain text editors such as MS Windows Notepad, or something more comfortable, such as specialized XML editors that are usually part of various integrated development environments. For example, the formula 𝑥2 + 𝑦2 can be written as follows: Listing 1: Example of the ‘hand made’ formula 𝑥2 + 𝑦2 However, the XML nature of MathML makes the coding of more complex formulae rather long for manual construction. Various software tools are more frequent sources of MathML. MathML can be generated as an output / data exchange format of complex specialized programs, such as Maple, Matlab, and Mathematica [9,20,22], or web services, such as the well known Wolfram Alpha [23], that are extensively used by mathematicians to support their work. generate::MathML(x^2 + y^2, Content = FALSE, Annotation = FALSE) Listing 2: Example of MathML export of the formula 𝑥2 + 𝑦2 by Matlab 7.9.0 MuPAD symbolic engine 4 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka Listing 3: Example of the MathML export of the Wolfram Alpha input query ‘x^2 + y^2’ On the consumer end of the document workflow MathML can be used as an input for mathematical programs and services (Maple, Matlab, Mathematica, Wolfram Alpha, etc.) or simply displayed — usually as part of an XHTML web page — in a web browser with MathML support. However, a large number of mathematical documents are produced using the TEX typesetting system and authored in TEX markup. Thus, it is necessary to be able to convert the TEX source code of mathematical formulae to the MathML language. Our main motivation is the WebMIaS system. For more complex input formulae, it would be uncomfortable for the user to manually construct queries in MathML, as the code would be very complicated. The well known LATEX syntax is far more appropriate for manual input. Therefore, we need a conversion from LATEX to MathML as part of the WebMIaS input routine. There are several tools that are able to convert TEX markup to the MathML language. For example, arXMLiv [16] employs LATEXML [19]. The EuDML project and our WebMIaS [8] system internally use Tralics [6]. Listing 4: Example of LATEXML generated MathML of formula 𝑥2 + 𝑦2 Normalization of Digital Mathematics Library Content 5 Listing 5: Example of Tralics generated MathML of formula 𝑥2 + 𝑦2 A frequent type of mathematical document in DML is the older papers that are unavailable in any digital-format or are available only in an ‘end’ format such as PDF that is suitable for reading and printing but is not appropriate for direct MathML processing. These documents can be a significant part of the DML content collection, so they are worth further processing. Documents available in hard copy only can be scanned and processed using InftyReader [17] optical character recognition (OCR) software. InftyReader has a unique feature for detecting mathematical formulae in a scanned document. These formulae can be subsequently saved as MathML. Listing 6: Example of InftyReader generated MathML from a PDF document containing only formula the 𝑥2 + 𝑦2 in its body Born-digital PDF documents with no available source codes can be processed using the MaxTract software [3,4], which that is under intensive development as part of the EuDML project. MaxTract generates LATEX source / XHTML+MathML representation of the document based on an optical analysis of the positions of 6 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka characters on the page. The analysis is supported with information from the fonts embedded in the processed document.
Listing 7: Example of XHTML + MathML generated by the development version of MaxTract from a PDF document containing only the formula 𝑥2 + 𝑦2 in its body During the MathDex project, it became clear that the most time- and resourcesconsuming task in building a math search engine and database is the normalization and conversion of heterogeneous sources [10]. As shown in Listings 1 — 6, MathML can vary slightly due to the different ways a code was obtained, even for a trivial formula like 𝑥2 + 𝑦2 . In a DML project, there can be differences in the final MathML encoding even for semantically and structurally similar formulae, due to the origins of the MathML from different sources. In Section 3, several more complicated examples of possible ambiguities in MathML are discussed that have to be normalized to allow math searches and similarity computation. 3 Use Cases Using our public working demo of the WebMIaS system we discovered several discrepancies in the form of MathML generated by the real-time TEX to MathML converter we currently use — Tralics — and by the MathML canonicalizer from the UMCL library. We employed the UMCL canonicalization module to try to normalize the users’ MathML input and the MathML produced by the LATEXML converter contained in the arXMLiv collection. Then we went through the Presentation MathML specifications and gathered a list of possible reformatting rules we could perform. Normalization of Digital Mathematics Library Content 7 The goal is to reduce the possible MathML scripts with the same semantics and mathematical structures to just one representation. To have such a canonicalized representation is convenient for many applications, as was described in Sections 1 and 2. Analyzing the issues of possible inconsistencies and ambiguities of MathMLencoded formulae raised design and strategy questions. Conceptual decisions for handling different types of similar constructions and completely different formulae need to be made. More specifically, for example, should we try to keep the MathML compact and reduce the number of nodes in transformations, or should we try to add nodes for better disambiguation? Another question is: should our future canonicalization tool produce valid MathML according to this schema? Unquestionably, this feature would be nice to have for many reasons and possible applications, but it certainly adds more requirements and takes much more effort to design and implement not only true/false validation, but also functional correctness validation. Below are described proposals and discussions of transformations that can be performed with relatively minor difficulty. The list is not complete and is subject to further evaluation. 3.1 Removing Elements and Attributes Many of the MathML elements used in Presentation MathML make little or no contribution to the semantics of the formula and therefore also to the formulae for indexing and searching. These are usually elements that alter the appearance of formulae in some way — space-like elements such as mspace, mpadded, mphantom, maligngroup, and malignmark. They may occasionally have some semantic meaning, but we prefer to canonicalize similar formulae into one representation rather than risk treating the same formulae as different. Therefore, these elements are best omitted. The content of the mtext element should be indexed as normal text before removal. Most element attributes are similarly undesirable. Many are used for formatting, affecting only the appearance of rendered formulae (for example, the attributes linebreak and indentalign of the mo element). Others might have some slight semantic significance, but are very uncommon and usually not very important; we think these attributes should be removed. However, several exceptions exist. For instance, the element mfrac is used for fractions but its meaning changes with the attribute linethickness set to 0, which express a binomial coefficient. The attributes of the element mfenced are also important (see Listing 9). The attribute mathvariant can also influence formula semantics and therefore should be preserved in all possible elements. For example, the MIaS system makes use of this attribute so that hits with the assigned mathvariant font specifying the attribute are more relevant. 8 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka