PA153 Natural Language Processing 08 - Lexicographic tools and computational lexicography Karel Pala, Adam Rambousek Centrum ZPJ, Fl MU, Brno December 16, 2020 Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 1/19 Q Lexicography • Introduction 9 History • Dictionaries and computers Q Computational Lexicography • Data representation • TEI • LMF • Dictionary Writing Systems Q Dictionary creation • Lexical database • Dictionary Karel Pala, Adam Rambousek PA153 N LP Lexicography o PLIN035 Computational Lexicography 9 subfield of lexicology • lexicography, lexikografie ► the activity or occupation of compiling dictionaries (Oxford d.) ► the editing or making of a dictionary (Merriam-Webster d.) ► the job of writing a dictionary (Macmillan d.) • practical lexicography • theoretical lexicography - analysis and description of the lexicon, theory of dictionary components, user groups, evaluation • Slovník národního jazyka náleží mezi první potrebnosti vzdelaného člověka. Computational Lexicography History o Ebla (Syria) clay tablets, cca 2500-2250 BC ► Sumerian - Ebla language • The Oxford English Dictionary (A New English Dictionary) ► 1857, Philological Society, R. C. Trench, criticizing dictionary ► 1879, James A. H. Murray appointed chief editor ► 1882-1928, published in 12 volumes, 15 487 pages, 240 000 entries Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 4/19 History • Kancelář Slovníku jazyka českého, 1911 ► volunteers gathering supporting materials ► excerpts from novels, poems, technical books, journals ► Příruční slovník jazyka českého, 1935-1957 ► 10 824 pages, 250 000 entries ► quotes by "unwanted authors"censored (Karel Čapek = Lid.nov.) Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 5/19 Future? Akademický slovník současné češtiny ► 2005-2010, lexical database (Praled) ► 2012-2016, applied research ► planned 120-150 thousands ► finished A (2700), B (3500), C+Č (3600), as of December 2020 ► mainly electronic (web, mobile) ► slovnikcestiny.cz The Oxford English Dictionary 3rd Edition ► 2000-2037?, budget £34M ► "Every word in the Dictionary is being reviewed" ► periodical updates in batches, 4x/year QED3 Revision Progress 300,000 200,000 150,000 100,000 sn.ooo 2005 UNREV 2010 KtV INEW Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography Dictionaries and computers • 1960s - computers are used, lexicographers writing on paper, operators typing into database, Brown Corpus • 1978, Longman Dictionary of Contemporary English ► 1st with limited definition dictionary, checked automatically ► special coding for NLP research • 1980, COBUILD, University of Birmingham + Collins ► contemporary corpus (Bank of English) ► 1987, Collins COBUILD English Language Dictionary ► 1st dictionary based on corpus data ► new definition style - full sentence ► If a person, animal, or other living thing is killed, something or someone causes them to die. 9 1990s - development of specialised dictionary writing systems • 1987, Text Encoding Initiative Karel Pala, Adam Rambousek PA153 NLP Computational Lexicography 7/19 XML 9 PB138 Modern Markup Languages • extensible Markup Language - markup (meta)language • rules for properly formatted document - easy machine processing and information exchange • actual markup specified by the user (standards, custom) 9 elements content 9 without content may be shortened to 9 attributes Computational Lexicography 8/19 Structure and content description • DTD (Document Type Definition) ► list of elements and attributes, and their relations ► no content checking ► • XML Schema (XSD, XML Schema Definition) ► description of XML document structure and content, schema itself is XML document ► elements, attributes, structure ► possibility to define custom content types (e.g. postal address) ► content checking (e.g. number range, regular expressions, allowed values) Karel Pala, Adam Rambousek PA153 N LP Computational Lexicography 9/19 Display • XSLT - extensible Stylesheet Language (Transformations) • converting XML to another format ► other XML markup, plain text, HTML, LaTeX, PDF • small templates for parts of XML document, recursive processing of the document • (functional programming language) ■ SSJC SLflynJ: ipLiovnthaj-icyki ŕíikthu lov -Um i 6 j -u) 1. stiháni t? zmocňováni se 2\iře (nejč odstřelem), chytáni ryb L jelenů, dlvokýíh kachen, velryb; I. lososů, I. perel; doba lovy, uspořádat L na medvídy; vyjet na 1.; právo lovu, 1. odstřelem, chytáním, lapáním, I leíiu, pobii, vodní, hromadný L hen lisV.a vyšla na 1.; lovu :dar' (lo-ncký poidrav) 2. tipr chytáni, shániniŕeŕioko/rv, vůbec získávám, přt kterém te uplatni obratnost a náhaáa I íiacního hmyzu, sbiratelí se vydal na 1. lidových písní; pobcle podnlHa L nazlodfje, «pr to Je L! iraimýrtälti^oéná koupí ep D. \ XSlčdčk !0\ U. ÚlOVek, kořlSt Vľáít Sť S bohatým ]OTťm r uJortnou JT-ír; pi SanlS (■> saper|1](") saper|2] < Í_iO sans-eseur (■) sani ■r B L*nwasaní L^iri?Siůn^ns.Mo*fi»o:=20Q9,0ř2320 i— Prongncisbc*: 1ejrt sď 3 POSŮroup: íi*>NurintBf=1.PartĎIS()e«íi=p«& Ö-Sense: 1 AutoNwnber=l UTE: 1EMHUI Exgmplfr f f=C'est Cí« (JncavJ lu p««( 43 j-Gxsmple 6xample=*0npe« ■ s*ns íonnaíssanct unconscious ■ sans dome no doubt, v^lhcut a doubí ■ sans (que) a ujHíss . Ei on velUait ft wc-í-j, bltn sůr. On auratiJamais latssé It mors sans qui quiiau un soil lá. And we waked !he body, of course. We would' ve never leň ihe body unless- someone was there. fTB) & without. 'T'auras pas bams dans la salle sans ti lefoui dehors. Yoo wouldn't have fought xti the dance nail without hun throwing you aut. (LA, An94> ^LA IE. An34. Oa84"- ■ 5* v* saru dire ií goes without saying *0aS4> lAdnwi] ■ Tu ts run qtt'un sans-cttur. You're sans.tftur [sdlSoo*ri fl. 1 heartless: crviH. ptit^ss person . rttihavt bui a truel man (SB) [Admml sa nSijois- fsd^wa] rr. rO. 1 ^jfeat blue hewi {*dnwA] Sanla Claus |«dlak]iz. SE(£kbz| rt.prt^. 1 Sanla Claus, AC. EV. 16. L^S. Ph36> |Adnvm] unto [sdt«| rt.if 1 heanh ■ ^'ef paj ^tf «r 'wnpicher de marcher a hi. Je ths. "li y a tine question J 'atmerafs ft demander. Qttci c esj tu/ais pour ta same? " II dit, "Jt vat au balprocitt :ous its satrs." I couldn't help bin wale ever M him. I sad, "There's 1 question I'd like to ask you What do you do toi >icw heaWi^'" He sajd, "I go to !he dance almost ever>- night.'" (ch: La tieige sur ia couierture) ■ A vou-g sanla to your health