PA153 Natural Language Processing 08 - Lexicographic tools and computational lexicography Karel Pala, Adam Rambousek Centrum ZPJ, Fl MU, Brno 21. listopadu 2018 Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 1/19 Q Lexicography • Introduction 9 History • Dictionaries and computers Q Computational Lexicography • Data representation • TEI • LMF • Dictionary Writing Systems Q Dictionary creation • Lexical database • Dictionary Karel Pala, Adam Rambousek PA153 N LP Lexicography o PLIN035 Computational Lexicography • subfield of lexicology • lexicography, lexikografie ► the activity or occupation of compiling dictionaries (Oxford d.) ► the editing or making of a dictionary (Merriam-Webster d.) ► the job of writing a dictionary (Macmillan d.) • practical lexicography 9 theoretical lexicography - analysis and description of the lexicon, theory of dictionary components, user groups, evaluation • Slovník národního jazyka náleží mezi první potrebnosti vzdelaného človeka. Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 3/19 History o Ebla (Syria) clay tablets, cca 2500-2250 BC ► Sumerian - Ebla language • The Oxford English Dictionary (A New English Dictionary) ► 1857, Philological Society, R. C. Trench, criticizing dictionary ► 1879, James A. H. Murray appointed chief editor ► 1882-1928, published in 12 volumes, 15 487 pages, 240 000 entries Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 4/19 History • Kancelář Slovníku jazyka českého, 1911 ► volunteers gathering supporting materials ► excerpts from novels, poems, technical books, journals ► Příruční slovník jazyka českého, 1935-1957 ► 10 824 pages, 250 000 entries ► quotes by "unwanted authors"censored (Karel Čapek = Lid.nov.) Computational Lexicography 5/19 Future? • Akademický slovník současné češtiny ► 2005-2010, lexical database (Praled) ► 2012-2016, applied research ► planned 120-150 thousands ► finished A (2700) December 2017; B+C in 2018? ► mainly electronic (web, mobile) • The Oxford English Dictionary 3rd Edition ► 2000-2037?, budget £34M ► "Every word in the Dictionary is being reviewed" ► periodical updates in batches, 4x/year Dictionaries and computers • 1960s - computers are used, lexicographers writing on paper, operators typing into database, Brown Corpus 9 1978, Longman Dictionary of Contemporary English ► 1st with limited definition dictionary, checked automatically ► special coding for NLP research • 1980, COBUILD, University of Birmingham + Collins ► contemporary corpus (Bank of English) ► 1987, Collins COBUILD English Language Dictionary ► 1st dictionary based on corpus data ► new definition style - full sentence ► If a person, animal, or other living thing is killed, something or someone causes them to die. • 1990s - development of specialised dictionary writing systems • 1987, Text Encoding Initiative Karel Pala, Adam Rambousek PA153 NLP Computational Lexicography 7/19 XML • PB138 Modern Markup Languages • extensible Markup Language - markup (meta)language • rules for properly formatted document - easy machine processing and information exchange 9 actual markup specified by the user (standards, custom) • elements content o without content may be shortened to o attributes Computational Lexicography 8/19 Structure and content description • DTD (Document Type Definition) ► list of elements and attributes, and their relations ► no content checking ► • XML Schema (XSD, XML Schema Definition) ► description of XML document structure and content, schema itself is XML document ► elements, attributes, structure ► possibility to define custom content types (e.g. postal address) ► content checking (e.g. number range, regular expressions, allowed values) Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 9/19 Display • XSLT - extensible Stylesheet Language (Transformations) 9 converting XML to another format ► other XML markup, plain text, HTML, LaTeX, PDF • small templates for parts of XML document, recursive processing of the document • (functional programming language) S5JC Slovník jpusYnftio jtyka idkfhn lov m b i«j -u) 1. stiháni a zmocňováni se wife (neji odstřelem), chytáni ryb L jelenů, divokých kachřti, velryb; I. lososů, I. perel; doba lovu; uspořádat L na medvídy; vyjft na ].; právo Jovu, I. odstřelem, chytáním, lapáním, I le^ni, pobii, vodní, hromadný 1. hun. lisV.a vyšla na ].; lovu 3dar' (itrnekýpoidrov) 2. tipr chytám, shánini Čehokotrv, vůbec získávání, přt kterém se uplatni obratnost a náhoda I. rcacníno hmy:u. sbírat*!*1 se vydat na L lidových písní; potící* podnikla L na:lodi]e; «pr io)*L! ifamýnáitz útulná koupi cp 3. výsledek Jovu, úlovek, kořist vrátit se s bohatým lovem sutonnou mřiep. pí»n nPr irfemr Kiimny™:oimirKůittn Haimóvnáhodou Slovní}: ipuovn* ítiltny lov -u m 1. loveni nife a ryb lov koroptvi, lov na zajíce, Hi>.a vyšla na lov, 1 úlovek ftvnol keřut iswnoi mít bohatý lov, Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 10 / 19 Storing • XML database o storing XML documents directly • searching - XPath, XQuery o e.g. eXist, BaseX, Sedna Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 11 / 19 TEI • Text Encoding Initiative, http: //www. tei-c. org/ • TEI Guidelines (current version 5, published 2007) • XML format for semantic description of text documents • wide range of markup tags • TEI Lite - smaller version, "90 % needs of 90 % of users " 9 novels, poems, theatre plays, technical reference, dictionaries, corpora, alignment, text revisions, musical notation... o tools - XSL transformations to I5TeX, docx, epub, HTML Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 12 / 19 LMF Lexical Markup Framework, http://www.lexicalmarkupframework.org/ ISO-24613:2008 common model for lexical resources emphasis on machine processing and extensibility UML diagram for the lexicon core with basic information + extensions for various areas (morphology, syntax, semantics...) : Global Information : Lexical Resource languageCoding = " ISO 639-3 ' 1 Lexicon language -'eng" ; Lemma I- : Lexical Bitry brüten Form = "clergyman* 1 pariQfSpeetri = "common Noun" : Word Form : Word Form wrltlsnForm = "clergyman" gram rn elicaiN um be r = *s 1 rtg uia r* wrifleriForm = "clergymen" gram maücalNumber = "plural" Karel Pala, Adam Rambousek PA153 N LP Computational Lexicog Dictionary Writing Systems • software application for dictionary creation (usually full process) o connected to other resources (corpora, analyzers...) • often custom developed • commercial (IDM DPS, iLex, TLex, ABBYY Lingvo Content) • DEB (Dictionary Editor and Browser) ► platform to build dictionary applications ► client-server, core libraries, specialized modules ► DEBDict, DEBVisDic, Internetová jazyková příručka, DEBWrite ► http://deb.fi.muni.cz Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 14 / 19 [Mew Owimttil Obj«t Mt-íd] IsliHMinelíic - [£\DirtiOníiy sandále ("] sarvJiMctif) sangler ("j sang-méré (*} sangsue(') sani ■sarrs-MEur (") sarrs-joie ("J sam* (■> ■sasui saper|l]C) ■í r-Pronun(i3bcn lest $0' POSÖroup AKöNurmtef = l.PartDISpe«3i=p«p .-. 1 AutoKumber^l UTE: TE=**thOUt Example Fiüir:; e=C'est quend tu pem da j~ Example Examples'On peu< Taire gans-tiavaille - ConrtMnartiDn: Lemmaitpri=sarraoes3Q,£lymDlc TE: iE=eoaiess TE: TE= ■ **ns enn Jans lc saile sons ti lefovt dekpry You wouldn't hm fought a die dance hall without tun throwing you out (LA, An94} *LA TB. Anji. 0*84? ■ ;a. v* s*n* dlro n goes ™(hout saying lAdnwi] Jív« řWJt f wVř íůflí-ťířťír. You're Eůns.tĎíur [ídíoůrll jí. 1 hnftlatt. vv«i. pititfrss ŕtfson ■ iKíhaig bw a cruel man. (SB) [Adminl sans.joiff (sťijwůj r. m 1 íjfeat blue Kí k n <í.oe:Lvee. R*31>(*dňwft] Ssnifl Claus Isftakbz. íÉtekbz) ň.prtp. 1 Sanla Claus lAdmni] u ni* (9ůte| mf. 1 h«aih ■ J'atpas pu m'wipächtr de Marcher á luS. Jeáis, "li y a une qutsttonj 'arflíŕHĎfí « demander. Qucs e ea tu fats pour a šatni?'M Ii dít, "Je vos ůtf búlpt&eli* sous les sc-Srs." I couJdn'! help hut walk ovei :o ham. I sari. 'Thtie's ä qitrstiön I'd Ute to ask you. What do you do tof your health?"' He said 'I go to the dance almost every night.'" (ch: La >ieige sur la coweriure) u i votra santš ta your híaAb Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 15 / 19 Lexical database • detailed structured database of language ► (recently) usage examples from corpus ► grammar ► valences, patterns ► language style, usage, region... ► word relations • foundation for dictionaries and research • PraLeD (Pražská Lexikální Databáze) • DANTE (Database of ANalysed Texts of English) Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 16 / 19 Dictionary creation • dictionary writing is expensive, laborious and time-consuming, competition • B. T. Sue Atkins, Michael Rundell: The Oxford Guide to Practical Lexicography Marketing Dept Editorial Dept user pro flies extent /contents styles & sampte entriss •-'i IT Dept i ? develop dictionary Design Dept print deiign Marketing Dept 1 Karel Pala, Adam Rambousek PA153 N LP Software Houie Computational Lexicography 17 / Dictionary content 9 macrostructure - entry list (+preface, appendices...) o heslo1 = lemma, entry term, heslové slovo, headword ► noun singular, verb infinitive ► word parts, collocations • heslo2 = heslová stať, entry a microstructure - structure of one entry in the dictionary ► checked by editing software ► easier orientation for the reader Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 18 / 19 Electronic dictionaries • more information (CD, DVD, web) ► presentation space o multimedia, searching, navigation, updates, external links • datamining user information ► Dictionary.com, subsequent search: bastion, hiatus, enmity, decorous • display information based on user profile o connection with corpora - ordnet.dk, DWDS.de... • combining resources, downloading data - Wordnik.com 9 user-created content (90-9-1) - Wiktionary, slovnik.zcu.cz... • Macmillan - switch to digital only e shift from products to services Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 19 / 19