DEBWrite: Free Customizable Web-based Dictionary
Writing System
Adam Rambousek, Aleš Horák
Natural Language Processing Centre
Faculty of Informatics, Masaryk University
Botanická 68a, 602 00 Brno, Czech Republic
{rambousek,hales}@fi.muni.cz
Abstract
Today, lexicographers can avail themselves of several commercial and freely distributed dictionary
writing systems (DWS). Nevertheless, there is still a group of users whose requirements are not
satisﬁed by existing DWSs. In various lexicographic forums, there is a growing demand for freely
available DWS that allows customization of the dictionary microstructure. In accordance with such
requests, a new project was developed as part of the DEB (Dictionary Editor and Browser) platform.
DEBWrite is implemented as a multi-platform web application based on open standards. It allows
users to create and share a new dictionary without any diﬃcult conﬁguration or advanced technical
skills. According to a deﬁned entry structure, the editing form and the public dictionary browser
are generated automatically. DEBWrite supports small and larger team cooperation when working
on the dictionary content. Access rights management for the created dictionary involves three levels
of user roles: a manager, an editor, and a reader. It is possible to publish the resulting dictionary in
various formats, both for human readers, and for external applications (e.g. NLP-related applications
that need to work with lexicographic data). The dictionary may be published in an online form, or
in formats suitable for print preparation.
Keywords: dictionary writing system; lexicographic platform; dictionary authoring; DEB platform
1. Introduction
There are several software tools available for dictionary creation and publication, both commercial
(e.g. IDM DPS (IDM DPS, 2006) or TLex (Joﬀe and de Schryver, 2004)), and freely available
(e.g. M¯at¯apuna (Moskovitz, 2004)). During the development of the DEB (Dictionary Editor and
Browser) lexicographic platform (Horák and Rambousek, 2007; Horák et al., 2008), we have designed
and implemented many lexicographic projects with complex entry structure or management. On
the other hand, we have also experienced demand for dictionary writing software in the form of
small size dictionaries with entry structure, usually by a small lexicographic team with limited
resources for their project. For such teams, existing free tools are too limiting, and commercial
tools are too expensive. Several such dictionaries were created using the DEB platform tools. For
example, the Terminological Dictionary of Fine Arts by the Faculty of Fine Arts, Brno University
of Technology (Horák and Rambousek, 2007), or the Czech-English Dictionary of Ethnological
443
Terminology by the The National Institute of Folk Culture1. To fulﬁl the requirements for such
range of dictionaries, a new application of the DEB platform was developed, called DEBWrite.
2. The DEB platform
Utilizing the experience from several preceding lexicographic projects, we have designed and implemented
a universal dictionary writing system that can be exploited in various lexicographic
applications to build distributed lexical databases. The system is called Dictionary Editor and
Browser, or the DEB platform (Horák and Rambousek, 2007, 2010). Since 2005, the DEB platform
was applied in more than 10 large international research projects. Large-scale applications based on
the DEB platform include the lexicographic workstation for the development of the Czech Lexical
Database (Horák and Rambousek, 2013) with detailed morpho-syntactic information on more than
213,000 Czech words, or the complex lexical database Cornetto combining the Dutch wordnet, an
ontology, and an elaborate lexicon (Horák et al., 2008). Currently ongoing projects include Pattern
Dictionary of English Verbs tightly interlinked with the corpus evidence (Maarouf et al., 2014),
Family names in Britain and Ireland (Hanks et al., 2011) providing detailed investigations for over
45,000 surnames to be published by Oxford University Press, or the dictionary of the Czech Sign
Language2 with an extensive use of video recordings to present the signs (Rambousek and Horák,
2015).
The DEB platform is based on the client-server architecture, which brings along a lot of beneﬁts. All
the dictionary and interlinked data are stored on a server and a considerable part of the functionality
is also implemented on the server-side, consequently the client application can be very lightweight.
This approach provides very good tools for editor team cooperation; data modiﬁcations are immediately
seen by all involved users. The DEB server also provides authentication and authorization
tools.
The server part is built from small, reusable parts, called servlets, which allow a modular composition
of all services. Each servlet provides diﬀerent functionality such as database access, dictionary search,
morphological analysis or a connection to corpora. The overall design of the DEB platform focuses
on modularity. The data stored in a DEB server can use any kind of structural database (or consult
several databases and join them into one compact dictionary storage) and prepare and combine
complex results of answers to user queries without the need to use speciﬁc query languages for each
data source. The main data storage is currently provided by the Sedna XML database (Fomichev
et al., 2006), which is an open-source native XML database providing XPath and XQuery access
to a set of document containers. Several DEB applications also work with connections to standard
relational databases, such as PostreSQL or MySQL, or to specialized data providers, such as the
geographical information system GRASS or a morphological analyser.
1
http://www.nulk.cz
2
http://www.dictio.info
444
The user interface, which forms the most important part of a client application, usually consists of a
set of ﬂexible complex forms that dynamically cooperate with the server parts. Client applications
can be implemented in any programming language that allows to interact with the DEB server
using the available server interfaces.
Client applications communicate with servlets using standard HTTP requests in a manner similar
to a popular concept in web development called AJAX (Asynchronous JavaScript and XML) or
using the SOAP protocol3. The data are transported over HTTP in a variety of formats – RDF,
XML documents, JSON-encoded data4, plain-text formats, or marshalled using SOAP.
The main assets of the DEB development platform can be characterized by the following points:
– All the data are stored on the server and a considerable part of the functionality is also implemented
on the server, while the client application can be very lightweight.
– Very good tools for (remote) team cooperation; data modiﬁcations are immediately seen by all
the users. The server also provides authentication and authorization tools.
– Server may oﬀer diﬀerent interfaces using the same data structure. These interfaces can be
reused by many client applications.
– Homogeneity of the data structure and presentation. If an administrator commits a change in the
data presentation, this change will automatically appear in every instance of the client software.
– Integration with external applications.
2.1 Linked Data
The term Linked Data refers to a methodology for publishing and interlinking structured data online.
This methodology was proposed by Berners-Lee in 2006 (Berners-Lee, 2006; Bizer et al., 2009), who
outlined four rules of how data are required to meet for easy sharing and interconnecting:
1. objects are identiﬁed by an URI5 (e.g. http://dbpedia.org/page/Brno),
2. URI identiﬁers are HTTP links, where people or software tools can access the data,
3. useful information are provided on given URI, using the appropriate standards (like RDF) (the
previously mentioned page contains links to the same information in multiple formats, RDF is
provided at http://dbpedia.org/data/Brno.rdf),
4. other objects are referenced using their URIs to get more information (e.g. link from the
Brno.rdf to http://dbpedia.org/resource/South_Moravian_Region).
All resources stored in the DEB platform can be published using the Linked Data methodology.
The DEB platform provides the tools for Linked Data presentation and the decision how to release
the data lies with the author. Linked Data requirements are satisﬁed in the following manner:
3
http://www.w3.org/TR/2007/REC-soap12-part0-20070427/
4
http://www.json.org/xml.html
5
Uniform resource identiﬁer (Berners-Lee et al., 2005)
445
1. use URIs as names – each entry has a unique URI identiﬁer,
2. use HTTP URIs – through the DEB platform API, entries are accessible on HTTP URI,
3. provide useful information using standards – when linking to an entry URI, the data are displayed
either in raw XML format, or converted to RDF or other deﬁned format,
4. link to other URIs – the DEB platform enables to link to other resources if provided by the data
author.
These requirements are fully embraced in DEB-based projects, DEBVisDic (Horák et al., 2006) and
the KYOTO project (Horák and Rambousek, 2010, 2009), where all the information were released
as Linked Data.
Berners-Lee later published a rating system for the distributed data, while expanding the term
Linked Data to Linked Open Data – which means Linked Data that are released under an open
licence. This rating system is aimed especially at government agencies to encourage them to publish
valuable (and reusable) information. The importance of Linked Open Data is acknowledged for
example by the European Union, funding projects like LOD2 (large integrating project to develop
tools, standards and management methods for Linked Open Data) or Open Data Portal (catalogue
of data available for reuse). The rating system follows these principles:
– 1 star – the data are available on the web in any format, with an open licence.
– 2 stars – the data are published in machine-readable structured format.
– 3 stars – the data use non-proprietary format.
– 4 stars – W3C open standards (RDF and SPARQL) are used to identify objects for linking.
– 5 stars – the data contain links to other resources to give context.
The DEB platform oﬀers a full support to the dictionary publisher to disseminate the dictionary
content as Linked Open Data:
1. published online with an open licence – this has to be decided by the data authors, but the DEB
platform enables releasing data on the web.
2. available as machine-readable structured data – documents in the DEB platform are stored in
an XML format which is machine-readable.
3. non-proprietary format – XML is a standardized format.
4. use open standards from W3C (RDF and SPARQL) – XML format itself is the W3C standard,
but to conform with this requirement more precisely, documents are converted to RDF format.
5. link to other resources – the DEB platform enables interlinking to other resources.
As demonstrated, the only limitation is the decision of the data authors regarding the licensing.
When this is resolved, the DEB platform enables to publish all documents as Linked Open Data.
446
Figure 1: Setting the entry structure.
3. The DEBWrite application
The DEBWrite application is implemented as a multi-platform web application, utilizing HTML5
and JavaScript standards6 that allow full interoperability and dynamic adaptations to current dictionary
interfaces. The DEBWrite application allows users to create and share a new dictionary
without any complicated conﬁguration or advanced technical skills. Based on experience with dictionaries
in the DEB platform, a default entry structure is proposed that ﬁts many dictionaries
(also with terminological dictionaries in mind). Each entry is composed of a top level information
(headword and its variants, grammatical information, domain/category) and any number of
meanings (each containing explanation and usage examples). Translations to various languages,
cross-references to other entries (with relation type), collocations, and external references may be
included on the entry level or meaning level. Within the dictionary deﬁnition form, users may alter
the entry structure in a graphical interface (see Figure 1) – deleting unnecessary information or
adding new entry ﬁelds, changing labels, or altering the option lists (relation types, languages for
translations, domains...).
According to the updated entry structure, the editing form and the public browser are generated
automatically. See Figure 2 for an example of the editing form. The dictionary website design is fully
customizable via CSS stylesheets or templates that are used for output generation. XSLT templates
are used as a default option, however HandlebarsJS template engine7 is also evaluated. Based on
the user feedback, the preferred template engine might be changed in the future DEBWrite updates.
The authors may either edit the source code of the output generating ﬁles, or select some of the
variables (e.g. colours and font styles) in the graphical interface (see Figure 3). In future versions,
more detailed graphical interface to change the output layout will be added. Each dictionary may
use multiple output templates to provide diﬀerent dictionary previews based on user settings.
6
with jQuery, https://jquery.com/, and jQuery UI, https://jqueryui.com/, libraries.
7
http://handlebarsjs.com/
447
Figure 2: Example of the editing form automatically generated from the settings.
The DEBWrite dictionary editor also supports upload of multimedia attachments (e.g. large ﬁgures,
audio or video recordings) to supplement the entries. The authors need to specify a special ﬁeld type
in the entry structure for ﬁle uploads. The server detects the attachment type (e.g. image, video,
audio) and displays the multimedia content in an appropriate form for the output. See Figure 4 for
an example of multimedia ﬁle upload and output.
In cases, when the lexicographers have some information prepared in advance, DEBWrite can simplify
the start of the dictionary creation process. A common scenario includes the situation, where
DEBWrite imports a list of headwords and automatically creates corresponding empty entries prepared
for expert editing. Another scenario works with the requirement of moving rich existing
structured data to DEBWrite. In such cases, DEBWrite can import a (part of the) full dictionary in
the XML format. As of now, the imported ﬁle must follow the XML structure used in the DEBWrite
application internally. However, a conversion between diﬀerent (compatible) XML structures is a
matter of applying an XSLT template conversion. Future versions of DEBWrite will support also
import of data in custom XML format.
The application also supports an export to standard XML ﬁle. Preprocessed XSLT templates are
included to export converted dictionary data into an HTML format for online publishing. For printed
or electronic edition in PDF, the data are converted to LATEX and subsequently to PDF format.
To enhance the possibility to share and re-use lexicographic resource sharing, DEBWrite also provides
the data in the form compliant with the Linked Data methodology (see section 2.1). The
decision about the data licensing and access control lies entirely on the dictionary authors, however
DEBWrite provides the tools needed to make the sharing easy.
448
Figure 3: Example of output design customizations.
Figure 4: Output representation of various media attachment types.
One of the major advantages of the DEBWrite application lies in its support of a team cooperation
on the dictionary preparation process. DEBWrite classiﬁes authorized users into one of three possible
user roles: a manager, an editor, or a reader (see Figure 5 for example of user access management).
– The user who created the dictionary is the dictionary manager. Managers may alter any dictionary
settings. They may grant access to the dictionary to other users, specifying their role.
Managers are able to edit all the dictionary entries and set an entry for publication. The manager
may also decide to make published entries publicly available, which means that no password is
needed to browse the dictionary (this might be regarded as a fourth user role in the dictionary
access management).
– An editor may edit entries before they are set to be published.
– Readers may browse and navigate through the published entries and their attachments with
advanced search capabilities.
449
Figure 5: User access management.
4. Conclusions
We have introduced a new customizable and freely available dictionary writing system named DEBWrite.
The application prototype is currently in public testing, available at http://deb.fi.muni.
cz/debwrite. As a part of testing, the Terminological Dictionary of Fine Arts was converted to
DEBWrite from the original application (where the editing form functionality was originally limited
to the Firefox browser only), allowing multi-platform editing and providing better user experience.
5. Acknowledgements
This work has been partly supported by the Ministry of Education of CR within the LINDAT-Clarin
project LM2010013. The research leading to these results has received funding from the Norwegian
Financial Mechanism 2009–2014 and the Ministry of Education, Youth and Sports under Project
Contract no. MSMT-28477/2014 within the HaBiT Project 7F14047.
6. References
Berners-Lee, T. (2006). Design Issues: Linked Data.
Berners-Lee, T., Fielding, R. & Masinter, L. (2005). Uniform Resource Identiﬁer (URI): Generic
Syntax. STD 66 (INTERNET STANDARD).
Bizer, C., Heath, T. & Berners-Lee, T. (2009). Linked Data-The Story So Far. International Journal
on Semantic Web and Information Systems (IJSWIS), 5(3), pp. 1–22.
Fomichev, A., Grinev, M. & Kuznetsov, S. (2006). Sedna: A Native XML DBMS. Lecture Notes in
Computer Science, 3831:272.
Hanks, P., Coates, R. & McClure, P. (2011). Methods for Studying the Origins and History of Family
Names in Britain. In Facts and Findings on Personal Names: Some European Examples,
Uppsala. Acta Academiae Regiae Scientiarum Upsaliensis, pp. 37–58.
Horák, A., Pala, K., Rambousek, A. & Povolný, M. (2006). DEBVisDic – First Version of New
Client-Server Wordnet Browsing and Editing Tool. In Proceedings of the Third International
WordNet Conference - GWC 2006, Jeju, South Korea. Masaryk University, Brno, pp. 325–328.
Horák, A. & Rambousek, A. (2007). DEB Platform Deployment – Current Applications. In
RASLAN 2007: Recent Advances in Slavonic Natural Language Processing, Brno, Czech Republic.
Masaryk University, pp. 3–11.
450
Horák, A. & Rambousek, A. (2009). Using Wordnets and Ontologies for Text-Meaning Assignment
- Implementation Details of the KYOTO Project First Phase. In Proceedings of the 4th
International Conference on Software and Data Technologies, Volume 2, Portugal. INSTICC,
pp. 303–307.
Horák, A. & Rambousek, A. (2010). Using DEB Services for Knowledge Representation within
the KYOTO Project. In Principles, Construction and Application of Multilingual WordNets,
Proceedings of the Fifth Global WordNet Conference, New Delhi, India. Narosa Publishing
House, pp. 165–170.
Horák, A. & Rambousek, A. (2013). PRALED – A New Kind of Lexicographic Workstation. In
Przepiórkowski, A., Piasecki, M., Jassem, K. & Fuglewicz, P., editors, Computational Linguistics:
Applications, Springer, pp. 131–141.
Horák, A., Vossen, P. & Rambousek, A. (2008). A Distributed Database System for Developing
Ontological and Lexical Resources in Harmony. In Lecture Notes in Computer Science:
Computational Linguistics and Intelligent Text Processing, Haifa, Israel. Springer-Verlag, pp.
1–15.
IDM DPS (2006). IDM Dictionary Production System. http://www.idm.fr/products/
dictionary_writing_system.
Joﬀe, D. & de Schryver, G.-M. (2004). TshwaneLex – Professional oﬀ-the-shelf lexicography software.
In Third International Workshop on Dictionary Writing Systems: Program and List
of Accepted Abstracts, Brno, Czech Republic. Masaryk University, Faculty of Informatics.
http://tshwanedje.com/tshwanelex/.
Maarouf, I. E., Bradbury, J., Baisa, V. & Hanks, P. (2014). Disambiguating verbs by collocation:
Corpus lexicography meets natural language processing. In Calzolari, N., Choukri, K., Declerck,
T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J. & Piperidis, S.,
editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation
(LREC’14), Reykjavik, Iceland. European Language Resources Association (ELRA).
Moskovitz, D. (2004). M¯at¯apuna Dictionary Database System. In Third International Workshop
on Dictionary Writing Systems: Program and List of Accepted Abstracts, Brno, Czech Republic.
Masaryk University, Faculty of Informatics. http://matapuna.thinktank.co.nz/
matapuna/.
Rambousek, A. & Horák, A. (2015). Management and Publishing of Multimedia Dictionary of the
Czech Sign Language. In Biemann, C., Handschuh, S., Freitas, A., Meziane, F. & Métais, E.,
editors, Natural Language Processing and Information Systems, NLDB 2015, Lecture Notes
in Computer Science, Springer, pp. 399–403.
This work is licensed under the Creative Commons Attribution ShareAlike 4.0 International License.
http://creativecommons.org/licenses/by-sa/4.0/
451