DML-CZ Metadata Editor Content Creation System for Digital Libraries Miroslav Bartošek, Petr Kovář and Martin Šárfy Institute of Computer Science, Masaryk University, Brno, Czech Republic bartosek@ics.muni.cz, kovar@ics.muni.cz, sarfy@ics.muni.cz Abstract. The aim of the DML-CZ project (2005-2009 - Czech Academy of Sciences, Masaryk University in Brno, Charles University in Prague, Czech Republic) is to investigate, develop and apply techniques, methods and tools that would allow the creation of the Czech Digital Mathematics Library. The most important tool developed and used in the course of the project is the Metadata Editor - a complex web-based system supporting all essential steps in the development of the article oriented digital library: integration of scanned pages (journals, proceedings, monographs) into hierarchical structures, article building, detailed metadata description up to the level of articles and book chapters, article bibliography references processing and linking, name authority management, born-digital material inclusion, automated metadata verification, and generation of the resulting PDF papers. The rights management in combination with the remote access enable to distribute the work an a digital library among many people with different levels of expertise. Building library of more than 15.000 articles proved soundness of the Metadata Editor architecture and implementation. Overview of the system architecture and functionality is briefly revealed in our paper. 1 Introduction In the Czech Digital Mathematic Library project1 (aimed to digitize and present relevant Czech mathematical literature since the 19th century) we focused on both restoring the original look-and-feel of the historical materials and delivering content-rich database with features required by today's scientific world like extensive hypertext linking or powerful searching mechanism [1]. There are many steps neccessary to achieve this goal [4]. An overall schema of the DML-CZ workflow is depicted in Figure 1. Scanned images of journals, proceedings and monographs are imported into hierarchical structure and identified with the article reference metadata harvested from reference databases Zentrallblatt-MATH or Mathematical Reviews (step 1). 1 Project 1ET200190513 - funded by the Academy of Sciences of the Czech Republic Programme "Information Society" (National Research Programme, 2005-2009) publishing filesystem-based system database Fig. 1. An overall picture of the workflow used in the DML-CZ project Detailed article descriptive metadata is then reviewed, corrected and completed by mathematicians (step 2). Processing of the article bibliographical references consists of reference block detection in OCR sources, splitting the block into particular references and individual reference structuring. Mathematical reference databases are then queried to identify corresponding records and to make hypertext links (step 3). The scanned images are enriched by OCR [5] and they form, together with generated cover page, the resulting article PDF file (step 4). These PDF files as well as all descriptive, structural and administrative metadata are then sent to our publication system based on the DSpace repository system (step 5), as described in [2]. There is also support for incorporating born-digital documents in our workflow (step 6). Publishers can use the tools we developed for collecting, managing and publishing the digital libraries, greatly reducing their overhead costs. It can be observed that to achieve the DML-CZ goals we needed a tool powerful enough to handle all the workflow processes. During the last three years, we gradually developed such a system. Metadata Editor (ME) is a web-based application that allows users to effectively manage digital library content creation. In this article, we briefly describe ME essential features and give a technical overview to our solution. 2 Metadata Editor Metadata editor (editor.dml.cz) is a client-server application consisting of web interface, suit of supporting scripts and an internal database. In the following we describe the Metadata Editor workflow used in our project. The main steps of the worfiow are as follows: — loading the input data into the ME internal structures; — article building - defining the logical structure of digitized publications; — metadata editing - creating descriptive metadata records from journal/ proceedings series/monograph levels up to the article or book chapter level; — bibliographical references processing - creating, harvesting and linking lists of references; — automated metadata verification; — final PDF compilation and export to the publication system. 2.1 Input data and its structure Metadata Editor works with data and metadata prepared in previous phases of the DML-CZ workflow from different sources, such as: — digitized old printed documents (created in the scanning phase); — materials already existing in some digital form (retrieved in the retro-born-digital conversion phase); — born-digital publications inserted to ME on-line by publishers (newly published journal issues created automatically 'as byproduct' of a publishing process). Metadata editor focuses primarily on scanned documents, but it can handle other sources as well (usually by using simplified and modified workflows). All the data obtained from the scanning/conversion phases (page images, initial structural and page description metadata) are validated with respect to their completeness and consistency, page order correctness, duplicities, etc. The data are then restructured, stored in the hierarchical directory structure suitable for further processing and enriched by metadata gained from OCR and mathematics reference databases. The Metadata Editor organizes objects in the following hierarchical structures: — serials - journal/volume/issue/article, — proceedings - proceeding series/proceedings volume/article, — monographs - collection/monograph/chapter. Each object in the Metadata Editor is managed using an unique identifier which reflects the path inside the directory structure where it is stored (the identifier also forms a part of the object's URL). 2.2 Article building Combination of several methods is used to create automatically the initial structure of articles of a journal issue (proceedings volume), and also to minimize the manual workload in the article building step. This includes exploitation of pagination information from reference metadata and localization of beginnings and ends of articles in OCR-ed texts. Tieing the pages automatically into the article structure is not always reliable and a manual check of the structure is still necessary. Sometimes the pages are badly assigned to articles, or some articles are not detected at all. It is then necessary to move pages, to create new articles or to delete a false ones. This problem applies to scanned documents only. Born-digital articles are well-structured implicitly. Metadata editor provides effective ways to handle the article building task. The most interesting tool is the visual article editor: the human operator works with the page thumbnails on a screen arranged tabularly like cards laid on the desk, as can be seen in Figure 2. This allows an easy visual inspection of pages, verification of the page ordering, reshuffling pages within an article and/or between articles, cancellation of badly identified articles and constituting the missing ones, removing blank pages, etc. By clicking on a thumbnail a large page image is open in a new window, allowing the operator to examine details of a given page. Page thumbnails are grouped to blocks of two types/colours: green blocks represent individual articles, red blocks consist of pages excluded from article processing (blank pages, front- and back-matter, advertisements, etc.) A set of auxiliary functions is available to handle a non-standard structuring of old printed journals (interleaving articles or page numbering schemas, articles crossing issue boundaries, etc): — page cloning (allowing to clone pages belonging to more than one article), — download/upload of page images (allowing for local corrections/improvements in images), — page number editing, — page reshuffling within an article and/or journal issues, — grouping of articles in named sections and subsections, — and other. Three different numbering schemas are used to identify pages in the Metadata Editor: — physical page numbers - numbers printed on original sheets of paper, — logical page numbers - the unique identifiers of pages within an issue/ proceedings / monograph, — sequential page numbers - keys defining order of pages within an issue / proceedings / monograph. Fig. 2. Contents structure editing page Logical numbers are always decimal numbers and are derived from the image file names assigned during the scanning phase. Sequential numbers keep pages in right order; they are not directly visible to the operator. If the automated article building process fails significantly, it may be time-consuming to create article structure manualy; in these cases a batch process of article building can be used to create all the articles in a issue at once, leaving the visual article editor for final visual checking only. Building articles (defining structural metadata) is permitted only to operators with appropriate structure editing rights. 2.3 Metadata editing After the pages are grouped into articles and the document structure is created the article metadata editing step is unlocked. The metadata record is usually pre-filled with the reference metadata by the automated process. Operator is typically required just to check it and/or to make minor changes only. Article metadata editing screen is split into two parts: the metadata editing form and the preview area. The left part of the screen consists of a form where the metadata are edited. The preview area displays the first page of the article, so the operator can easily access it while editing the article metadata. It is possible to flip through the pages using the list of the page links or keyboard shortcuts. The following metadata are assigned to an article: Keywords .........., i ........ ni a ry Language „„ a ■ Summary Language 1 ■ MSG_ m atiky, rot U (1 ííí), Pr-ihi prvků z abelovaké grupy je jejim faktorem ve anjyakl Hsjówve- Neoht a„) množiny Jtf, že i rovnice 11,0, + ... + «„a„ — 0 (0 je nulový prvek grapy ®),fedew, jsou celá čísla, plyne o, = 0proi= 1,n (viz [1], str.123). Necit M, ÍJ jsou dví neprázdné podmnožiny z <$. Potom Jf + S zn&oi množina viech Woh prvků z <&, které se dají psát jako součet prvku z M a prvku z S. Dá-li ee každý prvek x z <5 psát nanejvýš jedniro způsobem jako m + n,mřJÍ,»íJr, pEerae M ± N. Je-li @ = M + N a Jí J_ ff, pieeme WS Jí -j- A' a říkáme, že Jí aíf tvoří f aktorisaci pupy © to smyslu Hajůsové (viz též [2]) a M a Jí nazýváme faktory grupy ®. Dokážeme vetu: V8ta. ííeníiislíi mmoUna M c ® je jaMorem <& ne smjisíu SajóaovĚ. Důkaz. L Necht Jí je koneŕná množina, tedy Jř = {a,,o„}. Uká- SR = {%,.-.,oj-f.Efcm, + fejo, - 2a,) + ... + i„{o„ - na,). kbltt,...,h, probíhají moožinn celých cíeel] , kde 5H je nejmenä podgrupa z & obsahující množinu Jí, tedy x*%l<=>x - hfi, 4- V. + ■■■ + K", , h, celé ífelo (piSe m též 3E = [Jí)). Neohl tedy I = Vi +V> + •■■ Ä!-(-2Ä, + ...+nÄ. = ŕ» + a, kde 0 >. + Wh + — kde t, = S, pro Fig. 3. Article metadata editing page — Title - article title in several languages (title in the original language and its translation into English is required at least), — Author - author's name verified against the name authority database, — Language - language of the article, — MSC - MSC codes specifying the topics of the paper, — Summary Language - language of the article summary, — Article Type - type of article: math, physics, editorial, table-of-contents, history, .. . — Accessibility - can the article be made publicly visible? — idMR, idZBL, idJFM - identifiers to external databases MathSciNet, Zenterblatt-MATH, Jahrbuch liber die Fortschritte der Mathematik, — Status - metadata record processing status: untouched, in progress, completed. The operator can display the OCR text of a page in a separate window and use it for copy-and-paste editing. Author field is connected to the Name Authority database. By writing down just first few letters of author name the operator is offered a list of matching names, allowing to choose the author name correctly without mistakes. The MSC field offers similar functionality. When writing a code into the MSC field, the Metadata Editor checks if such a code exists or not. Data in idMR/idZbl/idJFM fields serve as links to reference databases MathSciNet, Zentralblatt-MATH and Jahrbuch liber die Fortschritte der Mathematik. Click on these anchors opens appropriate record of the given database in a new window - allowing to check visually the correctness of the identifier assignement. Link. The linking mechanism is used for bounding related articles. There are several different types of article relations: continuation articles, derived work, article review, suggested relevant papers, etc. Information about related articles might be suitably presented to users in the publication system. Currently, we use this feature to bound continuation articles only. References. Reference processing is proposed to be semi-automated with the human operator intervention in fixing errors from OCR processing and marking the basic reference structure. References are automatically identifed in the OCR fulltext using methods similar to those in the CiteSeer project2. A simple markup characters are then added by the system to the article OCR text (newline characters for identifying individual references within the block of references, and '//' characters for marking borders between authors and titles). In the next step, the result of automatic reference pattern detection is reviewed and corrected by the operator, if necessary. Using these markups, structured record of reference metadata is then generated. Finally, reference mathematical databases are queried to identify referenced articles and to establish hypertext links. Although the reference processing is highly automated, its resulting quality heavily depends on the quality of OCR. We are labouring nontrivial effort to achieve a compromise between the quality and extent of manual interventions required by human operators. Processing status. There are three different processing states assigned by the Metadata Editor to all objects in its database (articles, issues, volumes, journals, ...). — untouched - (grey) object and all its nested items were just imported into ME internal structures and has not been edited yet; — in progress - (red) object or at least one of its nested items were already edited; — completed - (green) object and all its nested items were completed and checked by an operator; the object is prepared for PDF-generating step and export to the publication system. The status is displayed as a colored bullet in front of the object title in the Metadata Editor. 2.4 Authority Base Name authority base was introduced to handle author names ambiguities correctly. The concept of authority database is inspired by the one used in traditional library management systems. An authority database record consists of 2 based on regular expressions for typical textual reference patterns author personal data (at least in the extent sufficient to distinguish between persons with the same name) and a set of name forms appearing in articles in the DML-CZ. BolzaB Description: Professi Origin: Date of Date of Status: Forms: (9) □ (4) □ (oj r Birth: Death: Surname Name Display Transliterated Attribute (Bolzano [Bernard [Bolzano, Bernard |Bolzano, Bernard 1 preferred [Bernard Bolzano 1 [Bernard Bolzano [Bernard Bolzano 1 other [Bolzano JBernard [Bolzano, Bernard |Bolzano, Bernard j other zl [ Save I Split I Moi J Mathematical Reviews | Delete | Articles: Baytraae zu einer begründeteren Darstellung der Mathematik Spisy Bernarda Bolzana. Svazek 2. Zahlentheorie Betrachtungen Uber einige Gegenstände der Elementaraeometrie On the best state Works of Bernard Bolzano: On the best state Spisy Bernarda Bolzana. Svazek 5. Geometrische Arbeiten Spisy Bernarda Bolzana. Svazek 1. Functionenlehre The correspondence of B, Bolzano and F. Exner mrev mrev mrev Bo zano, Bernard Bd zano, Bernard Bo zano, Bernard Bd zano, Bernard Bo zano, Bernard Bo zano, Bernard Bd zano, Bernard Eduard Winter; Bolzano, Bernard; Exner, František Fig. 4. Authority database management screen In the article or in the reference metadata, we store a name in the identical form as it appears in the original printed document as well as an optional internal identifier of the corresponding authority record. This approach allow us to manage several scenarios: — one person has several name forms (name with initials or full first names, pseudonym, transliterated forms of the name, etc.); — two (or more) persons have the same name and we want to distinguish among them in granting correct article authorship; — two (or more) persons have the same name, but we are uncertain who of them is the author of the article. The name authority database collecting a broad spectrum of different name forms is taken into account in the publication system DML-CZ3. Searching for a particular name form results in displaing all articles written by given author regardles of author's name forms used in the articles. 3 or at least we are working on that This model can also be extended by detailed personal information (like curriculum vitae, photograph,...) for more famous authors. Thanks to the name authority records locating and correcting spelling errors in names is quite an easy task. Authority database is getting bigger and bigger as the extent of the digital library grows continuously, so keeping authority database clean is a never ending process. 2.5 Searching and batch update mechanism The browsing capabilities of the Metadata Editor are supplemented by an easy to use searching module allowing operators quickly search for specified objects. General opology and its Re tions to Modern Analy is and Algebra (eng) 1 Applicati ns of Mathematics (cze) Archivům Mathematicum [e 9) Mathematicum (r Casopis ro pěstování mate ro pěstování mate Tnatikyt matiky a ze] fysiky (c?el "■pomoc ne*» Časopis pro p 2 st ován matemat.kya Fysiky - JENSTEJN (cze) "■pomoc ne*" Commentatic *ersitatis Carolinae -JENSTEJN [ ze) Commei ationes Mathema ersitatis Cam nae (eng) Qiierv function Element Prop rty Category Relation Display AH : None immrt Lanqi aqe AH ; None ; I_ vert (? eqUa| to AH : None ; Invert F Title K l" none equal to F Title FAutho Fnews I" empty FAuthor Fmsc <~ not empty Fmsc FidMR. F politics C exact FidMR. FldZBL Fedltoria FidZBL F idJFM F content F idJFM □ Note Fother FNote FNote: private F review FNote: private F Error F physics F Error .anguage Title | | ■ Append | 159 articles matches Article ID Type serJal/Czech.MathJ/29-1979-3/3 math On the differentiation of convex unctions In finite and Infinite dimensional spaces fenol Error Language 52AQ5 46G05 MR536060 0429.46007 2ĚA27 Fig. 5. Advanced search tool The searching mechanism can be used to search for the specified term in the selected metadata record element(s), to search articles by specified language, document category, and so on. It is possible to set one of the following relations between elements in a query: — equal to - metadata records containing the search term somewhere in the selected elements, — none equal to - none of the selected (possibly repetitive) elements contains the searched term, — exact - value of the selected element equals exactly to the searched term, — empty - metadata records with all the selected elements empty. — not empty - metadata records where at least one of the selected elements is not empty. The search tool can also be used for automated batch metadata update. When the search result set is displayed, privileged users can specify metadata element and a value that should be added or updated in all records in the result set. This simple batch update tool address most of the Metadata Editor operators needs without the necessity of system administrator intervention. 2.6 Automated metadata verification To keep data consistent and of a high quality, we developed powerful and extensible verficiation mechanism within the Metadata Editor with a set of useful tests. So far, the following verification tests were implemented: — test for missing mandatory metadata elements, — data storage integrity tests (data completeness and coherence, XML validation,...), — test of page ordering based on OCR data, — test of article language based on OCR language detection, — syntax of the TeX expressions used in metadata (titles), — syntax of markups used for reference identification, — statistics of the work progress (per individual document or per individual operator). Each verification test consists of an executable plug-in and a formal description of input parameters and output format. Formal description is used to build appropriate user interface for particular verification test and to display results interactively. It is also possible to specify only a subset of documents to be verified. Each test can be executed directly from Metadata Editor or scheduled to run on regular basis notifying system administrators by an e-mail. This feature is used to permanent (daily) monitoring of data storage integrity. 3 Technical Description Metadata Editor is a three-tier web application based on Nitro (Ruby) framework. It uses Model-View-Controller design pattern that clearly separates visual appearance from application logic and underlying data model (Figure 6). As a primary data storage, regular filesystem with very simple and self-explanatory schema is used. Utilizing filesystem as an "API" turned out to be a great advantage during integration of specialized external tools developed by independent programmers. Import/export scripts, reference linking tool, similarity searching algorithms [3], OCR, backup, etc. are all written in a real mixture of available programming technologies. Snapshot of metadata indices are also stored in an internal MySQL database for effective metadata browsing in Metadata Editor. This approach combines the advantages of both storage techniques: flexible administration and quick access. Apache Web Server Nitro (Ruby) web framework HTML+CSS templates controller view filesystem data storage: -TIFF, PDF, XML,... MySQL database: - metadata index model external tools (OCR, backup,...) Fig. 6. System architecture of the Metadata Editor Application user interface is, due to its client-server architecture, accessible from anywhere using any web browser4. The network communication is encrypted using HTTPS and the proposed authorization model takes into account the user role with respect to accessed data collection. The changes are logged, that among others allows to fairly reward hired students. The server is backuped and monitored, in the case of heartbeat failure or internal error the system developer is informed by e-mail. 4 Conclusions During building a digital library, consisting of more than 200.000 pages and 15.000 articles, Metadata Editor has proven its unique and mature ability to effectively manage all labor intensive tasks. Architecture of the remote access with thin client and robust authorization schema enables us to distribute the work among several people on different level of expertise. It simplifies the management of work scheduling and allows us to create library with minimal costs. We also want to make some improvements in ME during the oncoming final year of the DML-CZ project. In particular, we need to finalize the shift from the set of home-grown tools to production quality software suite, including installation package and proper documentation. There is no doubt that this is a great challenge. References 1. Bartošek, M., Lhoták, M., Rákosník, J., Sojka, P., Šárfy, M.: DML-CZ: The Objectives and the First Steps. In Borwein, J., Rochá, E.M., Rodrigues, J.F., eds.: CMDE 2006: Communicating Mathematics in the Digital Era. A. K. Peters, MA, USA (2008) 69-79. 2. Krejčíř, V.: Building the Czech Digital Mathematics Library upon DSpace System, (submitted to the workshop "Towards Digital Mathematics Library 2008"). 4 with a special focus on Mozilla Firefox for which we provided keyboard shortcuts and some other special features. 3. Radim Řehůřek, Petr Sojka: Classification of Multilingual Mathematical Papers in DML-CZ. In: Proceedings of Recent Advances in Slavonic Natural Language Processing- RASLAN 2007, Karlova Studnka, Czech Republic, Masaryk University, Brno (2007) 89-96 4. Sojka, P.: From Scanned Image to Knowledge Sharing. In Tochtermann, K., Maurer, H., eds.: Proceedings of I-KNOW '05: Fifth International Conference on Knowledge Management, Graz, Austria, Know-Center in coop, with Graz Uni, Joanneumm Research and Springer Pub. Co. (2005) 664-672. 5. Petr Sojka, Radovan Panák, and Tomáš Mudrák: Optical Character Recognition of Mathematical Texts in the DML-CZ Project. June 2006. Submitted to CMDE 2006.