NLP: The Main Issues Why is NLP difficult? - many "words", many "phenomena" --> many "rules" • OED: 400k words; Finnish lexicon (of forms): ~2 .10T • sentences, clauses, phrases, constituents, coordination, negation, imperatives/questions, inflections, parts of speech, pronunciation, topic/focus, and much more! - irregularity (exceptions, exceptions to the exceptions,...) • potato -> potatofes] (tomato, hero,...); photo -> photo even: both mango -> mango[s~|or -> mango es s,jm< • Adjective / Noun order: new book, electrical engineering, general regulations, flower garden, garden flower,but Governor General 9/11/2000 JHU CS 600.463/Intro to NLP/JanHajic Difficulties in NLP (cont.) - ambiguity • books: NOUN or VERB? - you need many books vs. she books her flights online • No left turn weekdays 4-6 pm / except transit vehicles (Charles Street at Cold Spring) - when may transit vehicles turn: Always? Never? * Thank you for not smoking, drinking, eating or playing radios without earphones. (MTA bus) - Thank you for not eating without earphones?? - or even: Thank you for^i^drinking without earphones!? * My neighbor's hat was taken by wind. He tried to catch it. - ...catch the wind or ...catch the hat ? 9/11/2000 JHU CS 600.465/Intro to NLP/JanHajic (Categorical) Rules or Statistics? • Preferences: - clear cases: context clues: she books --> books is a verb - rule: if an ambiguous word (verb/non verb) is preceded by a matching personal pronoun -> word is a verb - less clear cases: pronoun reference - she/he/it refers to the most recent noun or pronoun (?) (but maybe we can specify exceptions) - selectional: - catching hat >> catching wind (but why not?) - semantic: - never thank for drinking in a bus! (but what about the earphones?) 9/11/2000 JHUCS600.463/IntrotoNLP/JanHajic 11 Solutions • Don't guess if you know: * morphology (inflections) • lexicons (lists of words) * unambiguous names * perhaps some (really) fixed phrases • syntacticrules? • Use statistics (based on real-world data) for preferences (only?J) ) • No doubt: but this is the big question! 9/11/2000 JHU CS 600.465/Intro to NLP/JanHajic Statistical NLP • Imagine: - Each sentence W = { w1? w2, wn } gets a probability P(W|X) in a context X (think of it in the intuitive sense for now) - For every possible context X, sort all the imaginable sentences W according to P(W|X): - Ideal situation: ,- best sentence (most probable in context X) interpretation NB: same for 9/11/2000 JHU CS 600.463/Intro to NLP/JanHajic Real World Situation • Unable to specify set of grammatical sentences today using fixed "categorical" rules (maybe never, cf. arguments in MS) • Use statistical "model" based on REAL WORLD DATA and care about the best sentence only (disregarding the "grammatically" issue) best sentence 9/11/2000 JHUCS600.463/IiTtrütoNLP/JanHajic 14