PA153 Natural Language Processing 08 - Lexicographic tools and computational lexicography Karel Pala, Adam Rambousek Centrum ZPJ, Fl MU, Brno 16. listopadu 2015 Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 1/19 Q Lexicography • Introduction 9 History • Dictionaries and computers Q Computational Lexicography • Data representation • TEI • LMF • Dictionary Writing Systems Q Dictionary creation • Lexical database • Dictionary Karel Pala, Adam Rambousek PA153 N LP Lexicography o PLIN035 Computational Lexicography • subfield of lexicology • lexicography, lexikografie ► the activity or occupation of compiling dictionaries (Oxford d.) ► the editing or making of a dictionary (Merriam-Webster d.) ► the job of writing a dictionary (Macmillan d.) • practical lexicography 9 theoretical lexicography - analysis and description of the lexicon, theory of dictionary components, user groups, evaluation • Slovník národního jazyka náleží mezi první potrebnosti vzdelaného človeka. Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 3/19 History o Ebla (Syria) clay tablets, cca 2500-2250 BC ► Sumerian - Ebla language • The Oxford English Dictionary (A New English Dictionary) ► 1857, Philological Society, R. C. Trench, criticizing dictionary ► 1879, James A. H. Murray appointed chief editor ► 1882-1928, published in 12 volumes, 15 487 pages, 240 000 entries Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 4/19 History • Kancelář Slovníku jazyka českého, 1911 ► volunteers gathering supporting materials ► excerpts from novels, poems, technical books, journals ► Příruční slovník jazyka českého, 1935-1957 ► 10 824 pages, 250 000 entries ► quotes by "unwanted authors"censored (Karel Čapek = Lid.nov.) Computational Lexicography 5/19 Future? • Akademický slovník současné češtiny ► 2005-2010, lexical database (Praled) ► 2012-2016, applied research ► planned 120-150 thousands ► finished A (2700) to be published in December, B,C in 2017 ► mainly electronic (web, mobile) Dictionaries and computers • 1960s - computers are used, lexicographers writing on paper, operators typing into database, Brown Corpus 9 1978, Longman Dictionary of Contemporary English ► 1st with limited definition dictionary, checked automatically ► special coding for NLP research • 1980, COBUILD, University of Birmingham + Collins ► contemporary corpus (Bank of English) ► 1987, Collins COBUILD English Language Dictionary ► 1st dictionary based on corpus data ► new definition style - full sentence ► If a person, animal, or other living thing is killed, something or someone causes them to die. • 1990s - development of specialised dictionary writing systems • 1987, Text Encoding Initiative Karel Pala, Adam Rambousek PA153 NLP Computational Lexicography 7/19 XML • PB138 Modern Markup Languages • extensible Markup Language - markup (meta)language • rules for properly formatted document - easy machine processing and information exchange 9 actual markup specified by the user (standards, custom) • elements content o without content may be shortened to o attributes Computational Lexicography 8/19 Structure and content description • DTD (Document Type Definition) ► list of elements and attributes, and their relations ► no content checking ► • XML Schema (XSD, XML Schema Definition) ► description of XML document structure and content, schema itself is XML document ► elements, attributes, structure ► possibility to define custom content types (e.g. postal address) ► content checking (e.g. number range, regular expressions, allowed values) Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 9/19 Display • XSLT - extensible Stylesheet Language (Transformations) 9 converting XML to another format ► other XML markup, plain text, HTML, LaTeX, PDF • small templates for parts of XML document, recursive processing of the document • (functional programming language) S5JC Slovník jpusYnftio jtyka idkfhn lov m b i«j -u) 1. stiháni a zmocňováni se wife (neji odstřelem), chytáni ryb L jelenů, divokých kachen, velryb; I. lososů, I. perel; dcbalovu; uspořádat L na medvídy; vyjet na L; právo Jovu, I. odstřelem, chytáním, lapáním, I le^ní, polní, vodní, hromadný 1. hun. liíka vyšla na ].; lovu zdar' (itrnekýpoidrov) 2. tipr chytáni, shánini Čehokotrv, vůbec získávání, přt kterém se uplatni obratnost a náhoda I. rcacníno hmyzu. sbírat*!*1 se vydat na L lidových písní; petici* podnikla L na zloděje, «pr lojel! idtimymUti^winá koupi op 3. výsledek Jovu, úlovek, kořist vrátit se s bohatým lovem subrtnou mřiep. pí»n nPr irfemr Kiimny™:oimirKůittn Haimóvnáhodou 5l»vnJ: ipuovnt ítiltny lov -u m 1. loveni nife a ryb lov koroptvi, lov na zajíce, Hík.a vyíla na lov, 1 úlovek ftvnol keřut iswnoi mít bohatý lov, Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 10 / 19 Storing • XML database o storing XML documents directly • searching - XPath, XQuery o e.g. eXist, BaseX, Sedna Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 11 / 19 TEI • Text Encoding Initiative, http: //www. tei-c. org/ • TEI Guidelines (current version 5, published 2007) • XML format for semantic description of text documents • wide range of markup tags • TEI Lite - smaller version, "90 % needs of 90 % of users " 9 novels, poems, theatre plays, technical reference, dictionaries, corpora, alignment, text revisions, musical notation... o tools - XSL transformations to I5TeX, docx, epub, HTML Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 12 / 19 LMF Lexical Markup Framework, http://www.lexicalmarkupframework.org/ ISO-24613:2008 common model for lexical resources emphasis on machine processing and extensibility UML diagram for the lexicon core with basic information + extensions for various areas (morphology, syntax, semantics...) : Global Information : Lexical Resource languageCoding = " ISO 639-3 ' 1 Lexicon language -'eng" ; Lemma I- : Lexical Bitry brüten Form = "clergyman* 1 pariQfSpeetri = "common Noun" : Word Form : Word Form wrltlsnForm = "clergyman" gram rnalicaiN um ber = 'singular* wrifleriForm ='clergymen" g ram m alicalNum ber = "plural" Karel Pala, Adam Rambousek PA153 N LP Computational Lexicog Dictionary Writing Systems • software application for dictionary creation (usually full process) o connected to other resources (corpora, analyzers...) • often custom developed • commercial (IDM DPS, iLex, TLex, ABBYY Lingvo Content) • DEB (Dictionary Editor and Browser) ► platform to build dictionary applications ► client-server, core libraries, specialized modules ► DEBDict, DEBVisDic, Internetová jazyková příručka, DEBWrite ► http://deb.fi.muni.cz Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 14 / 19 [Mew Owimttil Obj«t Mt-íd] IsliHMineliic - [t\Dirtionfliydŕ sans sandiuaire (*> sandále ("] sarvJiMctif) sana> (') sangler ("j sang-méré (*} sangaueC) sani sans(*) ■sarrs-MEur (") sarts-jaie ("J sam* (■> ■mílii saper|l]C) ■í B LMHnw»na Lwm3Si0ft=B3n&MQd;liNuml>Bf = l.PartDISpe«3i=p«p .-. 1 AutoNumber^l UTE: TE=**trwut Example Fsüir:; c=C'e3l Ccfl quemd tu pem da j~ Example Examples* On petri Taire sans-tiavaille - CornttnaliDn: Lemmaitpri=sarrsces8a,EtymDlc TE: TE=«dl&S£ TE: TE=w Ute te^wttotrtadouM -; ComTMrialKjn- lemma5»on=sans (que).6tymöKX , i f. tot« 5ÜdÖ PBPCOlrt ■ P*tOŕSpí*ď prap. LUAJUUJU£JL£JULLdJUJLJ^^ H sa nj prep. 1 witboi/1 ► C est bet quand tu peux danser sans musique. It's good when yoi: cm dance without muss, (EV) - "QnptuxfetrQiQW n-owtffor it dsmancht. We cut do ft without worting on Sunday. (SLh AnS-i) ■ »ra cflssu *ndi*ss. cts*fiitt <0sSi> ■ **ns connaisMneo ^conscious ■. l ■-sans doule no tfEub! wlhout a ;c b\ ■ inns (que) a intitis ■ £: ah nrt/ifafr /* man, bltn s£r. On aitrafiJamais laissi it mart sar\; que que'qu un soit la. And we waked 'Jt\e body, of course. We uiou&' vt never left ihe body unless- someone was there (TB) b ^(hout ■ 'T'ovrvs pas i>enn dans lc sails sans ti ttjbvt tkhors. You wouldn't hav* fought a die dance hall without run throwing you out OLA, An94} *LA T6. Anji. 0*84? ■ v* s*(\s dire n goes without saying <0aSi> [Adnwi] Tu ts rten qtt'un sam-cnttr. You're Eans.tŕeur [Sťikůrři jí. 1 hnftlatt. vv«i. pities s p*fsofi ■ rttüuLg bvi a cruel man (SB) [Admm| sans.joiff (sdjwůj r. m 1 great blue her«« <í.*e:Lv66. R*31>(*dňwft] Sanu Claus Isftakbz. íŕtekl^il ň.prtp. 1 Sari j Claus |Adnnii] MfftlA (íůle| rt/. 1 h«aih ■ J'atpaspu m'wipfchtr dewarchtr á luS. Jsáis, "Iiy a uns quiiiionj 'atxwŘlt it dmawttr. Quel c 'ta tu fan pour at mwí? " // dít, "Jt vos au baipretftt sous its sů-irs." I Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 15 / 19 Lexical database • detailed structured database of language ► (recently) usage examples from corpus ► grammar ► valences, patterns ► language style, usage, region... ► word relations • foundation for dictionaries and research • PraLeD (Pražská Lexikální Databáze) • DANTE (Database of ANalysed Texts of English) Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 16 / 19 Dictionary creation • dictionary writing is expensive, laborious and time-consuming, competition • B. T. Sue Atkins, Michael Rundell: The Oxford Guide to Practical Lexicography Marketing Dept Editorial Dept user pro flies extent /contents styles & sampte entriss •-'i IT Dept i ? develop dictionary Design Dept print deiign Marketing Dept 1 Karel Pala, Adam Rambousek PA153 N LP Software Houie Computational Lexicography 17 / Dictionary content 9 macrostructure - entry list (+preface, appendices...) o heslo1 = lemma, entry term, heslové slovo, headword ► noun singular, verb infinitive ► word parts, collocations • heslo2 = heslová stať, entry a microstructure - structure of one entry in the dictionary ► checked by editing software ► easier orientation for the reader Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 18 / 19 Electronic dictionaries • more information (CD, DVD, web) ► presentation space • multimedia, searching, navigation, updates • longer descriptions, links to further resources • display information based on user profile o connection with corpora - ordnet.dk, DWDS.de... 9 combining resources, downloading data - Wordnik.com • user-created content (90-9-1) - Wiktionary, slovnik.zcu.cz... o Macmillan - switch to digital only o 0ED3 - 2000 to 2037, periodical updates o shift from products to services Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 19 / 19