$y$

Listing 7: Example of XHTML + MathML generated by the development version of MaxTract from a PDF document containing only the formula 𝑥2 + 𝑦2 in its body During the MathDex project, it became clear that the most time- and resourcesconsuming task in building a math search engine and database is the normalization and conversion of heterogeneous sources [10]. As shown in Listings 1 — 6, MathML can vary slightly due to the different ways a code was obtained, even for a trivial formula like 𝑥2 + 𝑦2 . In a DML project, there can be differences in the final MathML encoding even for semantically and structurally similar formulae, due to the origins of the MathML from different sources. In Section 3, several more complicated examples of possible ambiguities in MathML are discussed that have to be normalized to allow math searches and similarity computation. 3 Use Cases Using our public working demo of the WebMIaS system we discovered several discrepancies in the form of MathML generated by the real-time TEX to MathML converter we currently use — Tralics — and by the MathML canonicalizer from the UMCL library. We employed the UMCL canonicalization module to try to normalize the users’ MathML input and the MathML produced by the LATEXML converter contained in the arXMLiv collection. Then we went through the Presentation MathML specifications and gathered a list of possible reformatting rules we could perform. Normalization of Digital Mathematics Library Content 7 The goal is to reduce the possible MathML scripts with the same semantics and mathematical structures to just one representation. To have such a canonicalized representation is convenient for many applications, as was described in Sections 1 and 2. Analyzing the issues of possible inconsistencies and ambiguities of MathMLencoded formulae raised design and strategy questions. Conceptual decisions for handling different types of similar constructions and completely different formulae need to be made. More specifically, for example, should we try to keep the MathML compact and reduce the number of nodes in transformations, or should we try to add nodes for better disambiguation? Another question is: should our future canonicalization tool produce valid MathML according to this schema? Unquestionably, this feature would be nice to have for many reasons and possible applications, but it certainly adds more requirements and takes much more effort to design and implement not only true/false validation, but also functional correctness validation. Below are described proposals and discussions of transformations that can be performed with relatively minor difficulty. The list is not complete and is subject to further evaluation. 3.1 Removing Elements and Attributes Many of the MathML elements used in Presentation MathML make little or no contribution to the semantics of the formula and therefore also to the formulae for indexing and searching. These are usually elements that alter the appearance of formulae in some way — space-like elements such as mspace, mpadded, mphantom, maligngroup, and malignmark. They may occasionally have some semantic meaning, but we prefer to canonicalize similar formulae into one representation rather than risk treating the same formulae as different. Therefore, these elements are best omitted. The content of the mtext element should be indexed as normal text before removal. Most element attributes are similarly undesirable. Many are used for formatting, affecting only the appearance of rendered formulae (for example, the attributes linebreak and indentalign of the mo element). Others might have some slight semantic significance, but are very uncommon and usually not very important; we think these attributes should be removed. However, several exceptions exist. For instance, the element mfrac is used for fractions but its meaning changes with the attribute linethickness set to 0, which express a binomial coefficient. The attributes of the element mfenced are also important (see Listing 9). The attribute mathvariant can also influence formula semantics and therefore should be preserved in all possible elements. For example, the MIaS system makes use of this attribute so that hits with the assigned mathvariant font specifying the attribute are more relevant. 8 David Formánek, Martin Líška, Michal Růžička, and Petr Sojka