Why? How? Sketch grammar SET parser Synt parser Conclusions Links Syntactic analysis of natural languages Vojtěch Kovář NLP Centre Faculty of Informatics, Masaryk University Botanická 68a, 602 00 Brno xkovar3@fi.muni.cz PA153 Natural Language Processing Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Syntactic analysis What? reveal the structure of the sentence relationships among words, phrases Why? basis for more informed language analysis more than keywords semantic and logical analysis, question answering, ... applications can benefit from syntactic information red brick house vs. red house brick vs. brick house red Origins Noam Chomsky: Syntactic structures (1957) theory of formal languages Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Automatic syntactic analysis of natural languages Preprocessing sentence boundary detection word segmentation morphological analysis and disambiguation (named entity MWE recognition, lexical semantics, ...) compatibility issues Encoding phrase structure formalism dependency formalism partial analysis advanced – CCG, LFG, HPSG, TAG Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Phrase structure formalism – example I saw a man with a telescope .

Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Dependency formalism – example Isubject sawpredicate adet manobject withpp-attached adet telescopeprep-object . [root] Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Dependency vs. phrase-structure Non-projectivity disconnected phrases not natural in the phrase structure notation 20% of Czech sentences are reported to contain a non-projective dependency Phrase structure – more fine-grained analysis (new (queen of beauty)) (new generation)(of fighters) Coordinations and other “flat” phenomena not natural in the dependency notation problem for dependency analysis Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Non-projectivity – example Malouattr mělpredicate chaloupkuobject . [root] Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Non-projectivity in phrase structure formalism Malou měl chaloupku . Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Non-projectivity in phrase structure formalism Malou měl chaloupku . Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Non-projectivity in phrase structure formalism měl Malou chaloupku . Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Phrase structure expressivity Newmodifier queen ofpp-attached beautyprep-object [root] Newmodifier generation ofpp-attached fightersprep-object [root] Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Phrase structure expressivity New queen of beauty

New generation of fighters

Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Coordinations – dependency structure velmimodifiertěžkýcoord aattr rozměrnýcoord fragment [root] Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Coordinations – phrase structure velmi těžký a rozměrný fragment Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links CCG: Combinatory Categorial Grammar Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links LFG: Lexical Functional Grammar Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links HPSG: Head-driven Phrase Structure Grammar Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links TAG: Tree Adjoining Grammar Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Parsing methods Rule-based RASP, synt, SET, Žabokrtský, Dis/VaDis manually created grammar CFG (CKY parser, chart parser), dependency grammar, Prolog DCG, ... Statistical MaltParser, MST Parser, Stanford parser, ... grammars created from annotated data by statistical methods direct guessing the tree shape Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Parsing evaluation Treebanks corpora manually annotated for syntactic structure Penn Treebank, Prague Dependency Treebank (PDT) Tree similarity metrics PARSEVAL: precision, recall, F-score over phrases Leaf-ancestor assessment: edit distance over root-leaf paths dependency precision labelled or unlabelled best results: 85–93 percent Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Problems Central problem massive ambiguity “I saw a man with a telescope” “A plane fell into a field next to a forest.” problems with evaluation Is the task well-defined? inter-annotator agreement rarely reported in case of PDT around 90% Sampson showed that above 95% is unreachable → current parsers are very good however, rather low usage in applications Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Problems (II) Low usage compared to e.g. morphological tagging no use in Google, Seznam, Facebook, ... Wikipedia page for information extraction does not even mention parsing or syntax neither does a Czech question answering system (Konopík, Rohlík) ACL anthology: 7,232 matches for word “parser”, 133 matches for using parsers (Jakubíček) Are the results useless? Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Problems (III) Application-sparse output trees do not provide all the information needed but at the same time they do contain noise Application-free evaluation tree similarity metrics do not correlate well with accuracy of the end applications as illustrated by Myiao, Google research, our collocation extraction research Technical aspects parsers hard to run, output not readable Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Treebank problems Apart from evaluation problems, treebanks are expensive old domain-specific unambiguous Treebank formalisms enforce annotation manuals containing hundreds of pages senseless annotations and garbage Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links ,AuxX telAtr .AuxG :AuxG(AuxG 0649Atr )AuxG64Atr 13Atr ,AuxX FAXAtr :AuxG(AuxG 0649Atr )AuxG64Atr 11Atr Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links ČSNSb 46Atr 3130Atr -AuxG 75Atr aCoord jejíAtr změnySb 7Atr /AuxG 1983Atr ,AuxX1Atr /AuxG 1984Atr ,Coord 8Atr /AuxG 1989Atr Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links VětšinuObj těchtoAtr přístrojů Atr lzePred takéAuxZ používat Sb nejenAuxY jakoAuxY faxAtv ,AuxX aleCoord současněExD i jakoAuxC výkonnou Atr kopírku ExD , Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links AtrAuxX pokudAuxC jeAuxV nakupovánAdv vAuxP rolíchAdv jePred pakAdvuváděnPnomvAuxP prospektech Adv , Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Proposed solution: You aren’t gonna need it Rapid application development „worse is better” „keep it simple stupid” (KISS) „you aren’t gonna need it” (YAGNI) completeness, consistency, correctness, simplicity Implications start from applications strong emphasis on interaction with applications do not develop/implement theory that is not immediately needed simple, imperfect parsers, possibly task-specific rule based first, until we find what we actually need extrinsic evaluations Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Sketch grammar: A shallow approach to syntax Designed for collocation extraction Kilgarriff and Rychlý, The Sketch Engine based on Corpus Query Language results of queries scored statistically → pragmatic partial syntactic analysis Extensions multi-word sketches bilingual word sketches terminology extraction bilingual terminology extraction Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Word Sketch – original Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Sketch grammar example *DUAL =subject/subject_of 2:[tag="N.*"] [tag="RB.?"]{0,3} [lemma="be"]? [tag="RB.?"]{0,2} 1:["V.[^N]?"] Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Terminology extraction Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Sketch grammar for terminology extraction =terms *COLLOC "%(2.lc)_%(1.lc)" 2:[tag=="NN" | tag=="JJ" | tag=="VVG"] 1:[tag=="NN"] *COLLOC "%(3.lc)_%(2.lc)_%(1.lc)" 3:[tag=="NN" | tag=="JJ" | tag=="VVG"] 2:[tag=="NN" | tag=="JJ" | tag=="VVG"] 1:[tag=="NN"] Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links SET – a light-weight parsing system Hybrid trees combination of dependency and phrase structure formalisms readability, natural analysis Pattern matching grammar similar to CQL manually created and ranked rules rules → matches → sorting → best tree Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Hybrid tree I saw a man and a woman with a telescope . Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links SET rule example TMPL: (tag k5) ... $AND ... (tag k5) MARK 0 2 4 PROB 500 HEAD 2 $AND(word): , a ani nebo Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Synt – a traditional CFG+ parser CFG backbone + contextual actions manually created CFG grammars for Czech, Slovak, English statistical ranking of rules chart parser + extensions Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Conclusions There are many ways to approach syntactic analysis none of them became dominant in practice (yet?) Basic formalisms dependencies phrase structure Manual as well as statistical approaches Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages Why? How? Sketch grammar SET parser Synt parser Conclusions Links Links www.diotavelli.net/people/void/demos/cky.html en.wikipedia.org/wiki/Definite_clause_grammar en.wikipedia.org/wiki/Combinatory_categorial_grammar en.wikipedia.org/wiki/Head-driven_phrase_structure_grammar nlp.fi.muni.cz/projekty/wwwsynt nlp.fi.muni.cz/projekty/wwwsynt/query.cgi nlp.fi.muni.cz/trac/set nlp.fi.muni.cz/projekty/set/wwwset.cgi/first_page ufal.mff.cuni.cz/pdt2.0/index-cz.html Vojtěch Kovář FI MU Brno Syntactic analysis of natural languages