Introduction to Natural Language Processing (600.465) Linguistic Essentials: Phonology and Morphology Dr. Jan Hajie CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic The Description of Language • Grammar • set of rules which describe what is allowable in a language • Classic Grammars (Quirk et al.) • meant for humans who know the language • definitions and rules are mainly supported by examples • no (or almost no) formal description tools; cannot be programmed • Explicit Grammar (CFG, LFG, GPSG, HPSG, Dependency Grammars, Link Grammars,...) • formal description • can be programmed & tested on data (texts) 10/2/2000 JHU CS 600.463/Intro to NLP/JanHajic Levels of (Formal) Description • 6 basic levels (more or less explicitly present in most theories)! - and beyond (pragmatics/logic/...) - meaning (semantics) - (surface) syntax - morphology - phonology - phonetics/orthography • Each level has an input and output representation - output from one level is the input to the next (upper) level - sometimes levels might be skipped (merged) or split 10/2/2000 JHU CS 600.463/Intro to NLP/JanHajic Phonetic s/Orthography ■ Input: - acoustic signal (phonetics) / text (orthography) • Output: - phonetic alphabet (phonetics) / text (orthography) • Deals with: - Phonetics: • consonant & vowel (& others) formation in the vocal tract • classification of consonants, vowels, ... in relation to frequencies, shape & position of the tongue and various muscles in the v.t. • intonation - Orthography: normalization, punctuation, etc. 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 4 Phonology • Input: - sequence of phones/sounds (in a phonetic alphabet); or "normalized" text (sequence of (surface) letters in one language's alphabet) [NB: phones vs. phonemes] • Output: - sequence of phonemes (~ (lexical) letters; in an abstract alphabet) • Deals with: - relation between sounds and phonemes (units which might have some function on the upper level) - e.g.: [u] ~ oo (as in book), [as] ~ a (cat); i ~y (flies) 10/2/2000 JHU CS 600.463/Intro to NLP/JanHajic Morphology • Input: - sequence of phonemes (~ (lexical) letters) • Output: - sequence of pairs (lemma, (morphological) tag) • Deals with: - composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding) - e.g. quotations ~ quote/V + -ation(der.V->N) + NNS. 10/2/2000 JHU CS 600.463/Intro to NLP/JanHajic (Surface) Syntax • Input: - sequence of pairs (lemma, (morphological) tag) • Output: - sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms • Deals with: - the relation between lemmas & morph. categories and the sentence structure - uses syntactic categories such as Subject, Verb, Object,... - e.g.: I/PPl see/VB a/DT dog/NN ~ ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 7 Meaning (semantics) Input: - sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) Output: - sentence structure (tree) with annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) Deals with: - relation between categories such as "Subject", "Object" and (deep) categories such as "Agent", "Effect"; adds other cat's - e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ (I/Sg/Patrt (see/Perf/Pred/t) Tom/Sg/Ag/f) 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 8 .and Beyond • Input: - sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) • Output: - logical form, which can be evaluated (true/false) • Deals with: - assignment of objects from the real world to the nodes of the sentence structure - e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see (Mark-Twain[ SSN:... ],T om-Sawy er[SSN:... gjj^ :bef 99S9rnn4. m wmmmW^m&m 10/2/2000 JHU CS 600.463/Intro to NLP/JanHajic Phonology • (Surface ^ Lexical) Correspondence • "symbol-based" (no complex structures) • Ex.: (stem-final change) - lexical: b a b y + s (+denotes start of ending) - surface: babies (phonetic-related: bebTOs^) • Arabic: (interfixing, inside-stem doubling) (lit. 4read5) - lexical: kTb+UU+CVCCVC (CVCC... vow el/consonant pattern) - surface: kuttub 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic Phonology Examples • German (umlaut) (satz ~ sentence) - lexical: s A t z + e (A denotes ''umlautable"a) - surface: satz e (phonetic: zaecS, vs. zacj • Turkish (vowel harmony) - lexical: e v + 1 A r (chouses) b a s + 1 A r - surface: e v 1 e r (heads^) b a s 1 a r • Czech (e-insertion & palatalization) - lexical: m a t E K + 0 (smothers/gen.) m a t E K + e - surface: m a t e k (mother/dat.^) mate e 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 11 Morphology: Morphemes & Order • Handles what is an isolated form in written text • Grouping of phonemes into morphemes - sequence deliverables -> deliver, able and s (3 units) - could as well be some "ID" numbers: • e.g. deliver ~ 23987, s ~ 12, able ~ 3456 • Morpheme Combination - certain combinations/sequencing possible, other not: • deliver+able+s, but not able+derive+s; noun+s, but not noun+ing • typically fixed (in any given language) 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 12 Morphology: From Morphemes to Lemmas & Categories • Lemma: lexical unit, "pointer" to lexicon - might as well be a number, but typically is represented as the "base form", or "dictionary headword" • possibly indexed when ambiguous/polysemous: ■- state1 (verb), state2 (state-of-the-art), state3 (government) - from one or more morphemes ("root", "stem", "root+derivation", ...) • Categories: non-lexical - small number of possible values (< 100, often < 5-10) 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic Morphology Level: The Mapping • Formally: A+^2^Cl'C2-Cn> - A is the alphabet of phonemes (A+ denotes any nonempty sequence of phonemes) - L is the set of possible lemmas, uniquely identified - Q are morphological categories, such as: • grammatical number, gender, case • person, tense, negation, degree of comparison, voice, aspect,... • tone, politeness,... • part of speech (not quite morphological category, but...) - 2(LC1C2' Cn) denotes the power set of (L,C1;C3,...,Cn) - A, L and Q are obviously language-dependent 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic 14 The Dictionary (or Lexicon) • Repository of information about words: - Morphological: • description of morphological "behavior": inflection patterns/classes - Syntactic: • Part of Speech • relations to other words: - sub categorization (or "surface valency frames") - Semantic: • semantic features • valency frames - ...and any other! (e.g., translation) 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 15 The Categories: Part of Speech: Open and Closed Categories Part of Speech - POS (pretty much stable set across languages) - not so much morphological (can be looked up in a dictionary), but: - morphological "behavior" is typically consistent within a POS category - Open categories: ("open" to additions) * verb, noun, pronoun, adjective, numeral, adverb - subject to inflection (in general); subject to cross-category derivations - newly coined words always belong to open POS categories - potentially unlimited number of words - Closed categories: • pre p o si ti on, c onjuncti on, arti cl e, interj e c tio n, cliti c \ p arti cl e - not a base for derivation (possibly only by compounding) - finite and (very) small number of words 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 16 The Categories: Part of Speech, Open Categories: Verbs • Verbs: - infl. Categories: person, number, tense, voice, aspect, [gender, neg.],... - syntactic/semantic: classification: • ordinary: (to) speak, (to) write • auxiliaries: be, have, will, would, do, go (going) • mo dais: can, could, may, should, must, want • phasal: begin, end, start - morphological classification • conjugation tvp e: regular/irregular, (Ge.: weak/strong/irregular) — conjugation class: (Cz.: 5 classes + ~100 combinations) 10/2/2000 JHU CS 600.463/Intro to NLP/JanHajic The Categories: Part of Speech, Open Categories: Nouns • Nouns: infl. categ ories: number, [gender, case, negation,...] - semantic classification: f human/animal/(non-living) things: driver/bird/stone • concrete/abstract: computer/thought • common/proper: table/Hopkins - Syntactic Classification: countable/unc.: book, water - morphological classification: • pluralia/singularia tantum: data (is), police (are) • declension type ("pattern" or "class") (Cz.: 14 basic patterns, plus deviations: ~300 patterns, + irregular inflection) • "adverbial" nouns: afternoon, home, east (no inflection) 10/2/2000 JHU CS 600.405/Intro to NLP/JanHajic The Categories: Part of Speech, Open Categories: Pronouns • Pronouns: infl. categ ories: number, gender, case, negation; person - much like nouns (syntactic usage also similar) - (pro)noun ~ "stands for" a noun - classification (mostly syntactic/semantic): • personal: I, you, she, she, it, we, you, they • demonstrative: this, that • possessive: my, your, her, his, its, our, their; mine, yours, ours,... • reflexive: myself, yourself, herself,..., oneself • interrogative: what, which, who, whom, whose, that • indefinite ("nominal"): somebody, something, one - morphological classification: mostly idiosyncratic pattern 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 19 The Categories: Part of Speech, Open Categories: Adjectives • Adjectives: - infl. categories: degree of comp., [number, gender, case, negation] - classification: • ordinary: new, interesting, [test (equipment)] f possessive: John's, driver's • proper: Appalachian (Mountains) • often derived from verbs/nouns: teaching (assistant), trendy, stylish - morphological classification • mostly regular declension (Cz.: 4 basic patterns, ~ 10 total) • degrees of comparison (En.: big, bigger, biggest) • but: large number of forms (agreement, cf. section on syntax) 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic 20 The Categories: Part of Speech, Open Categories: Adverbs • Adverbs: "infl." categories: degree of comp., [negation] - open cat.: regular derivation from adjectives common: • new S newly, interesting —> interestingly - non-derived adverbs: • ordinary: so, well, just, too, then, often, there • wh-adverbs (interrogative): why, when, where, how • degree adverbs/qualifiers: very, too - morphological classification (not much, really...) • degree of comparison: well, better, best ■- soon, sooner (other lang.: all 3 degrees regular) 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic The Categories: Part of Speech, Open Categories: Numerals • Numerals: infl. categories: number, gender, case, negation - open cat.: compounding (Ge.: einundzwanzig, 21) - classification: * cardinals: one, five, hundred -■ NB: million etc. often considered noun * ordinal s/fractionals: first, second, thirtieth * quantifiers: all, many, some, none * multiplicative: times, twice (Cz.: dvaadvacetiirat, 22-times) * multilateral: single, triple, twofold - morphological classification: as nouns/adjectives; many irreg. 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic The Categories: Part of Speech, Closed Categories * Closed Categories! preposition, conjunction, article, interjection, clitic, particle - Morphological behavior: indeclinable • preposition: of, without, by, to; • conjunction: coordinating: and, but, or, however subordinating: that, if, because, before, after, although, as • article: a, the; • interjection: wow, eh, hello; • clitic: (s; may be attached to whole phrases (at the end) • particle: yes, no, not; to (+verb); h many (otherwise) prepositions if part of phrasal verbs, e.g. (look)up 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic The Categories: Number and Gender • Grammatical Number: Singular, Plural - nouns, pronouns, verbs, adjectives, numerals * computer / computers; (he) goes / (tfiey) go - In SOme languages (Czech): Dual (nouns, pronouns, adjectives) * (PI.) noharrri / (Dl.) nohama (Cz.; (by) legs (of sth)/(by) legs (of sb)) • Grammatical Gender: Masculine, Feminine, Neuter - nouns, pronouns, verbs, adjectives, numerals * he/she/it; Hiiran, Hinajia, nuiajio (Ru.; (he/she/it) was-re a ding) * nouns: (mostly) do not change gender for a single lexical unit Also: animate/inanimate (gram., some genders), etc. * Mädchen (Ge.; girl, neuter); diti (Cz.; children, masc. inanim.) 10/2/2000 JHU CS 600.463/Intro to NLP/JanHajic 24 The Categories: Case Case - English: only personal pronouns/possessives, 2 forms ~ other languages: 4 (German), 6 (Russian), 7 (Czech,Slovak,...) * nouns, pronouns, adjectives, numerals - most common cases (forms in singular/plural) * nominative * g enitive * dative * ac cus ati ve * vocative * locative * instrumental Pwe (work) (picture of) me/us (give to) me/us (see) me/us -/- (about) me/us (by) me/us toída/toídy (Cz.; class) tsidy/toid toídi/toídám toídu/toidy toído/toídy toídi/toí dách toídou/toídami 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic The Categories: Person, Tense Person - verbs, personal pronouns * 1st, 2nd, 3rd: (I) go, (you) go, (he) goes; (we) go, (you) go, (they) go jdu, jdeš, jde, jdeme, jdete, jdou (Cz.) Tense - past: (you) went - present: (you pi.) go - future (! if not" analytical") - concurrent (gerund) going - preceding (Cz.: go) (Pol.: go) szlioecie jdete idziecie půjdete -jda id'c szed3szy 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic 26 Note on Tense • Grammars: more (syntactic/sematnic) tenses - but: morphology handles isolated words -» some tenses can be defined & handled only at an upper level (surface syntax) • Examples of (traditional) tense (synthetical and analytical): • infinitive: (to) write (tenseless, personless,except negation (Cz.)) • simple present/past: (I) write/(she) writes; (I,she) wrote • progressive present/past: (I) am writing; (I) was writing • perfect present/past: (I) have written; (I) had written • all in passive voice (cf. later), too: - (the book) is being/has been/had been written etc. • all in conditional mood, too (mood: in Eng. not a morph. category!) - (the book) would have been written 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 27 The Categories: Voice & Aspect • Voice - active vs. passive * (I) drive / (I am being) driven * (Ich) setzte (mich) / (Ich bin) gesetzt (Ge.: to sit down) • Aspect - imperfective vs. perfective: * noKynaji / Kyroin (Ru.: I used to buy, I was buying) / I (have) bought) - imperfective continuous vs. iterative (repeating) * spal/ spával (Cz.: I was sleeping /1 used to sleep (every...)) 10/2/2000 JHU CS 600.405/Intro to NLP/JanHajic The Categories: Negation, Degree of Comparison • Negation: - even in English: impossible (~ not possible) • Cz: every verb, adjective, adverb, some nouns; prefix ne- • Degree of Comparison (non-analytical): - adjectives, adverbs: • positive (big), comparative (bigger), superlative (biggest) • Pol.: (new) nowy, nowszy, najnowszy • Combination (by prefixing): - Order? both possible: (neg.: Cz./Pol.: ne-inie-, sup.: nej-/naj-) • Cz.: neireemoDnijsi (the most impossible) • Pol.: nienajwierniejszy (the most unfaithful) 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic Typology of Languages • By morphological features - Analytical: using (function) words to express categories • English, also French, Italian,Japanese, Chinese - I would have been going~ (Pol.) sz3abyrn - Inflective: using prefix/suffix/infix, combines several categ. • Slavic: Czech, Russian, Polish,... (not Bulgarian); also French, German; Arabic - (Cz. new(acc)) novcw (Adj, Fem., Sg., Aco, Non-neg., Pos.) - Agglutinative: one category per (non-lexical) morpheme • Finnish, Turkish, Hungarian - (Fin. plural): -i- 10/2/2000 JHU CS 600.465/Intro to NLP/JanHajic 30 Categories & Tags • Tagset: - list of all possible combinations of category values for a given language - TczCjxqx... xCn - typically string of letters & digits: * compact system: short idiosyncratic abbreviations: - NNS (gen. noun, p lural) * positional system: each position i corresponds to - AAMP3—-2A—- (gen. Adj., Masc, PL, 3rd case (dative), comparative (2nd degree of comparison), Affirmative (no negation)) - tense, person, variant, etc.: N/A (marked by "empty position", or' - ') • Famous tagsets: Brown, Perm, Multext[-East], ... 10/2/2000 JHU CS 600.465/IntrotoNLP/JanHajic 31