Computational morphological analysis of Czech Pavel Smerk Natural Language Processing Centre Faculty of Informatics Masaryk University http://nip.f i.muni.cz/ma, aurora:/nip/projekty/aj ka these slides: http: //www.f i .muni . cz/~smerk/majka 2.10.2019 Pavel Smerk (NLPC Fl MU) □ S Comp, morphological analysis of Czech Morphological analysis Morphological analysis basic level of text processing • (word forms are obvious in Czech, except gen., byl-li, oc/ocs etc.) for a word form, the morphological analysis should return • lemma, the dictionary form of the word • possible grammatical meanings („tags") — values of relevant grammatical categories like part of speech, case, number, person • e.g., for word form strojwe expect • stroj: noun, masculine inanimate, singular, nominative/accusative • strojit. verb, 2nd person singular, imperative mood, imperfective + synthesis, lemmatization (returns only the lemma), ... • (it's not decomposition to morphemes, as one could guess) the talk has three parts • what information we want to catch and describe (s. 5-6) • how can we organize that data (s. 7-9, 11-18) • how to implement the analysis itself (s. 10, 19-22) Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 2/22 Morphological analysis Tags strings representing grammatical information positional tagset: a tag consists of values only • the corresponding category is determined by the position in the tag • Prague system — 16 positions: part of speech, detailed PoS, gender, number, case, ..., negation, ... • NNIS4-----A----- • noun, general, masc. inanim., singular, accusative, affirmative • full description attributive tagset: attribute-value pairs, the order does not matter • Brno system — similar categories and values like in Prague • e.g., attribute c for case with values 1 to 7 • klgInSc4 = noun, masc. inanim., singular, accusative • no detailed PoS and negation • advantages: shorter, easier to read and extend, simpler RE • https://nlp.fi.muni.cz/raslan/2011/paper05.pdf Comp. morphological analysis of Czech 2.10.2019 3/22 Pavel Smerk (NLPC Fl MU) Morphological analysis Tags heterogeneous" tagset (Bratislava) • like the positional system, but without empty positions • the first character denotes the part of speech, the following symbols correspond to attribute-value pairs • the order is fixed, although each symbol is used in only one „sense" • SSis4 • noun, noun declension, masc. inanim., singular, accusative • pros: the shortest tags • cons: hard to extend — limited set of ASCII symbols • ^> two character values like :r for proper names • https://korpus.sk/morpho_en.html different type of language, different solution: BNC tagset • a fixed set of few dozens tags, for example: • AJO Adjective (general or positive) (e.g. good, old, beautiful) • AJC Comparative adjective (e.g. better, older) • AJS Superlative adjective (e.g. best, oldest) • PNX Reflexive pronoun (e.g. myself, yourself, itself, ourselves) • http://www.natcorp.ox.ac.uk/docs/c5sj^c.html_ _ Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 4/22 Morphological analysis What we want to describe seems obvious at the first sight, taught at grammar school but disputes are both practical and theoretical (linguists) choices in lemmatization • take into account word derivation? • otcova otcův/otec, učený učený/učit, učení učení/učit • nejstaršího starý/nejstarší (searching: [věk]... člověk) • (+ „starší paní" can be younger than „stará paní") • nebral brát/nebrat (úplatky); nemalý malý/nemalý • bachelor thesis from VŠMIE: in online marketing singular and plural of nouns are treated as different key words • what about equivalents {dubiety)? • mysli ^> myslet/myslit • kapitalismem kapitalismus/kapitalizmus Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 5/22 Morphological analysis What we want to describe selection of grammatical categories and their values • parts of speech: abbreviations, interpunction, numbers, contractions (cos, oc, kdyby) • categories: subdivision of pronouns, numerals, adverbs, case for prepositions, animateness koho/ceho • values: dual number (pes se 4 nohama), subdivision of pronouns a bigger problem is to set possible tags for a word form • e.g., which parts of speech can be attributed to a, ani, at, az, ... the most problematic is to set rules for analysing a word form in a particular sentence context • if a word form can have tags A or B, it should be clear which one to select in a particular context (interannotators aggreement) • it's hard to learn computer to decide between A and B if even the native speakers do not agree Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 6/22 Morphological analyser ajka Morphological analyser ajka „originál" solution (Osolsobě 1996, Sedláček 1999+2005) organization of data • (which forms belong to the same lemma is known a priori) • word forms "stem" (longest common left substring) + "endings" • lemmata with the same ending set belong to the same paradigm • kluk is like vlk, but not like pes or slon nom. sg. vl-k p-es slon-0 gen. sg. vl-ka p-sa slon-a dat. sg. vl-ku p-su slon-u dat. sg. vl-kovi p-sovi slon-ovi nom. pi. vl-ci p-si slon-i technical solution: "intersegment" between the stem and the ending • vl-k-0, p-es-0, slon-0-0; ... vl-c-i, p-s-i, slon-0-i; ... • smaller data, the principle is the same Pavel Smerk (NLPC Fl MU) Comp, morphological analysis of Czech 2.10.2019 7/22 Morphological analyser ajka Dictionary and paradigm files format • dictionary file • format lemma:paradigm, ! has negation, % reflexiva tantum, notes hanbit:barvit!% 1793.1,167.1 zelený:nový!1148.1 osel:orel 1180.1 paradigm file: paradigm definition • paradigm lemma + + list of ending sets +barvit NEWES717, NEWES744, konc44 NEWES710 NEWES705, NEWES778 NEWES757 NEWES759 Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 8/22 Morphological analyser ajka Dictionary and paradigm files format paradigm file: ending set definition • set of ending + tag pairs • (the names are arbitrary, generated) =NEWES717 {t, k5aImF} =NEWES705 {y, k5aImAgFnP} {i, k5aImAgMnP} {a, k5aImAgFnS} • • • • interpretation • get the stem by deleting the first intersegment and the first ending from the end of the lemma, then add intersegments and endings • hanbit = hanb + -i + -t hanb-i-t k5aImF, ..., hanb-il-i k5aImAgMnP, ... Comp. morphological analysis of Czech 2.10.2019 9/22 Pavel Smerk (NLPC Fl MU) Morphological analyser ajka Principle of the analysis analysed word form w2 ... wn = S + / + E any part, stem S, intersegment /, or ending E, can be empty • e.g. slon-0-0 or O-clovek-0, 0-lid-e =4> possible stems: e, , ..., 1/1/1 ... wn for each possible stem S = ... w-, in the list of stems it tries to find candidates for ... wn = / + E in the paradigm of the stem the result are the tags corresponding to the found triplets S+l+E in fact it's a bit more complicated, as the analysis works also with possible prefixes nej and ne and postfixes like s in Byls tarn? Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 10/22 New data format Disadvantages of the old data format the basic principles are the same in both Brno and Prague • dictionary of stems + set of paradigms, i.e., endings with tags • stems belong to paradigms; by joining the stem with its paradigm endings one obtains word forms with tags • both stems and endings are strings, which are only joined together the main disadvantage follow from that: redundancy • Luděk/Luďka, Staněk/Staňka, vrah/vraha, medvídek/medvídka, etc., are declined in a very similar way but need separate paradigms (or exceptions in Prague system) in a long term, redundancy leads to inconsistency • e.g.: adding of a colloquial gen. sg. -a: muža for masc. anim. • 217 paradigms needs to be automated: Gsg -e -a • but ca 10 paradigms had -é instead of expected -e • strašpytel and neumetel already had -a • it's hard or even impossible to check the results Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 11/22 New data format New data format • dictionary and paradigm files remains • the goal is to separate the regular and the irregular • dictionary: what is specific for particular lemmata • what a language user has to remember • paradigms + program: "language system" • endings and their regular behaviour, phonological rules • stems are in the dictionary: slompan • endings forms the paradigms: pan klgM nScl 0 nSc2 a nSc3 u, ovi • • • • the stems are joined with the endings: slon-0, slon-a, ... • the corresponding tags are concatenations of the paragigm part and the ending-specific part: klgMnScl, klgMnSc2^ %y _ _ Comp. morphological analysis of Czech 2.10.2019 12/22 Pavel Smerk (NLPC Fl MU) New data format New data format some simple rules transform the strings (slon-0) to word forms • obviously, we have to remove all - and 0 • ňe —)► ně: tuleň-e —>► tuleňe —)► tuleně • or ň-e —>• ně: tuleň-e —>• tulen-ě —>• tuleně • the first intermediate form corresponds to what is read • Ábel x ďábel Abel x ďáb.el: .eC-0 -> eC-0, .eC-V -> C-V • (the phonological context is the same it's dictionary information) • vlk-i —)► vlc-i (and also pán-i —>► páň-i —>► páni —>► páni) the use of endings can be restricted according to end of the stem • e.g. nPc6 ech, ich/ [ghk] I ch (in a paradigm) even only these few improvements allows us to unify description of many (in the old format) distinct paradigms • Luď. ek-0 -» Luďek-0 -» Luďek -» Luděk • pej s . ek-ich —)► pej sk-ich —>► pe j sc-ich —)► pe j scich Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 13/22 New data format New data format some other enhancements paradigm inheritance: soudce:muž nScl e nSc5 e • by default, the endings for given tags are replaced • +nSc5 e would add the ending • restricted inheritance: despota:pan_nP + singular endings partial paradigms for some specific endings: -ové klgM nPcl ové using multiple paradigms: f ilozof:pán,-ové dřevokaz:pán,+muž Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 14/22 New data format New data format — technical details in Czech :-) dále • hovorové tvary: Npl (a Vpl) ?učitelové, ale *pokrytcé • obecně: 1) ne/lze -é; 2) které z koncovek -i a -oi/éjsou spisovné • filozof:pán,<-ové; občan:pán,<-é; akrobat:pán,<-i,+-é • (bez < bych musel substandardní koncovky definovat ve vzorech -é) • více slovních základů, nepravidelné tvary (tedy slovník) přítel:muž,<-é Marceli\ Marcelu • despot:žena_nS,-ovi,pán_nP gM gigol:město_nS,+-ovi,pán_nP gM (ě/!gM) • (skládání značky, implicitní značka, implicitní vzor, ...) Pavel Smerk (NLPC Fl MU) ' 7 Comp. morphological analysis of Czech 2.10.2019 16/22 New data format New data format — from paradigms to features • native speakers do not remember paradigms for all words but decline words according to some semantic, structural, or phonological features • for proper names -ové is preferred over -i • words derived with suffix an are pán, <-é • masculine animates that end with c/have "hard" declension • implicit rules: typical, regular behaviour controlled by • phonological features of the stem end or • semantic features described by a tag in the dictionary $klgM \Ko město_nS,+-ovi,pán_nP,muž_nP/$M|i,-ové s/qJO muž,90% we need to know only (a part of) the tag redundancy is reduced the description is more linguistically acceptable a1 Pavel Smerk (NLPC Fl MU) Comp, morphological analysis of Czech 2.10.2019 18/22 New morfological analyser majka New morfological analyser maj ka ajka was quite fast, but too complex =4> unmaintainable we employed an approach from Jan Daciuk's dissertation thesis • the analysis is only searching the word form in WLT list • in fact, the data is a list query:response with the following format ježek:A:klgMnScl j ežka:Cek:klgMnSc2 j ežka:Cek:klgMnSc4 krtek:A:klgMnScl krtka:Cek:klgMnSc2 krtka:Cek:klgMnSc4 the list is a finite language ježek ježka ježka krtek krtka krtka ježek:klgMnScl ježek:klgMnSc2 ježek:klgMnSc4 krtek:klgMnScl krtek:klgMnSc2 krtek:klgMnSc4 there is a DAFSA for it • encoding the lemma allows the neccessary minimalization • Daciuk offers incremental construction that preserves minimality (NB: this part is independent: WLT can be generated from the old format, and data for ajka can be generated from the new format) a1 Pavel Smerk (NLPC Fl MU) Comp, morphological analysis of Czech 2.10.2019 19/22 New morfological analyser majka New morfological analyser maj ka non-minimalized deterministic automaton for the example data k. :. A. :. kv lv gv M nv Sv cv lv e >°->°——>o >o——>o e v z v >^ o->o->or_ k a : C e k *0' 2 >o k 1 ff M n S c ť* o—^->o ~ )o " >o " >o—'—>o->o->o & >o->o->o->o->o^4 k v : v A : ^ k 1 g ^ M n S c 1 TC "t ^— " av : t C, ev k, : v k, 1, gv M v n, Sv c, J>* ——>o——>o->o->o->o——>o->o->o—2->o-—>o->o—■—>o->or_ 4 minimalized deterministic automaton for the same data k • A : k 1 g M n S c 1 e >?-—■—-——---——>o-—■———->0 Q 2 ________i~ i/ arCekrklgMnSc-* "analysis" is just fast and simple passing through the FSA • deterministic for the "query" + all the "responses" a1 Pavel Smerk (NLPC Fl MU) Comp, morphological analysis of Czech 2.10.2019 20/22 New morfological analyser majka New morfological analyser maj ka in a similar way data for lemmatization, generation, etc.: lemmatization: krtek: A, krtka:Cek generation: krtek: ArklgMnScl, krtek:Cka:klgMnSc2 • or from lemma and tag: krtek :klgMnSc2:Cka "deep" structure: krtek: C. ek-0, mužova: D=°/0ov-a • or after application of some rules: krtek: Cek-0, krtka: Ck-a prefixes: nemalý: CA: k2*, malý: Ane : A: k2*/malý: ACneA: k2* Pavel Smerk (NLPC Fl MU) Comp. morphological analysis of Czech 2.10.2019 21/22 New morfological analyser majka New morfological analyser maj ka statistical information about (some) dictionaries dictionary lines source MB dictionary MB bytes/line w 13,609,590 186 3.3 0.240 w —> 1 14,101,767 240 4.0 0.287 w ->■ l+t 80,303,929 2,478 4.4 0.054 w —> w 957,464,060 19,993 6.1 0.006 comparison with morphological analyser ajka data size time in seconds ajka majka ajka majka ration analysis 4.4 18.22 2.88 6.3x lemmatization 3.1 4.0 16.76 1.57 10.7X word forms 6.1 55.33 8.42 6.6x diacritics 3.3 8698.80 1.61 5403X analysis is ~4.6 x faster than Prague analyser Morfo majka is used in, e.g., Seznam.cz or IS MU projects r3" Pavel Smerk (NLPC Fl MU) Comp, morphological analysis of Czech 2.10.2019 22/22