Introduction to Natural Language Processing (600.465) Markov Models Dr. Jan Hajič CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic 10/1&00 J HU C S 6 0 0.465 ň. ňtŕ o to N LP/J an H aji c Review: Markov Process Bayes formula (chain rule): p(w) = p(w1;w2,...,wT) = n1=1 T pCwjw^w^:^.^,..^^) n-gram language models: - Markov process (chain) of the order n-1: p(w) = p(w1;w2;...,wT) - n1=1 T p(w1|w1.n+1;w1 *;2,..,wM) Using just one distribution (Ex.: trigram model: p(w1|wl2?w1.1)): Positions: 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Words: 'My car(broke downJ&nd within hours Bob 's canbroke dowiv too . p(,|broke down) = p(w5|w3,w4)) = p(w14|w13,w13) 10/18^00 JHUCSö00.4ö5/IntfotoNLP/JanHajic 2 Markov Properties • Generalize to any process (not just words/LM): - Sequence of random variables: X = (X1?X2,...,XT) - Sample space S (states), size N: S = {s0,s1?s2,...,sN} 1. Limited History (Context, Horizon): Viel.^PmX^...^) = WMM 1 7 3 7 9 0 6 7|Í3| 4 5... 1 7 3 7 9 0 6\7\\3\ 4 5 2. Time invariance (M.C. is statiónařyVliomogeneous) Vi el..T, Vy,x e S; P(Xi=y|X1.1=x) = p(y|x) 1 [ŤjS Hill 0 6 00 4 5... \_/^T'--- \ / jok...same distribution 10/18^00 JHUCSö00.4ö5/IntfotoNLP/JanHajic 3. Long History Possible What if we want trigrams: 0^3 4 5... 1 7 3 7 9 0 Formally, use transformation: Define new variables Qi? such that Xi = {Q^O^}: Then ppcjx,!) = P^iAIQ^A-i) = p(QilQi-2?Qii) Predicting (X): 1 7 3 7ro~n1 6 7 3 4 5 ££££ I AA£AA History (X,.! = {Q^Q,-!»: WW 3 \1 \0 6 La mm i 4 2L3 10/18^00 JHU CS 600.465/Intro to NLP/Jan Haji c 4 Graph Representation: State Diagram * S {S[|,s^;S2?...,Sj^}. states • Distribution Pp^X^): ♦ transitions (as arcs) with probabilities attached to them: 10/18^00 JHU CS 600.465/1 ntro to NLP/Jan Haji c The Trigram Case • S = {s0,sl5s2v..;sN}: states: pairs st = (x,y) • Distribution P^X^): (r.v. X: generates pairs sL) p(toe) = .6 X .88 X .07= .037 p(one) = ? 10/18/00 JHUCS600.465/IntfotoNLP/JanHajic 6 Finite State Automaton States ~ symbols of the [input/output] alphabet - pairs (or more): last element of the n-tuple Arcs ~ transitions (sequence of states) [Classical FSA: alphabet symbols on arcs: - transformation: arcs nodes] Possible thanks to the "limited history" M'ov Pro So far: Visible Markov Models (VMM) 10/18/00 JHU CS 6 00.465/Intro to NLP/Jan Haji c Hidden Markov Models • The simplest HMM: states generate [observable] output (using the "data" alphabet) but remain "invisible": Added Flexibility • So far, no change; but different states may generate the same output (why not?): Output from Arcs. • Added flexibility: Generate output from arcs, not states: .624 o 10/18^00 JHUCS 600.465/IntrotoNLP/JanHajic 10 and Finally, Add Output Probabilities ! simplified! Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc: p(t) = 0 p(o) = 0 p(e) = l p(t) = .8 p(o) = .l p(e)=.l p(t) = .5 .P(o) = .2 p(e)=3 p(l) = 0 sJ5 mark them by output symbols, get rid of output distributions: p(toe) = .48x.616x.6+ .2xlx.l76 + .2x1x12 = .237 In the future, we will use the view more convenient for the problem at hand. 10/18^00 JHUCS600.465/IntrotoNLP/JanHajic 12 Formalization • HMM (the most general case): - five-tuple (S, s0, Y, Ps? PY), where: * S = {sqjSj fS2?...,sT} is the set of states, s0 is the initial state, * Y = {ylfy2v..fyv} is die output alphabet, * Pjtsjlsj) is the set of prob. distributions of transitions, - size ofPs: |S|2. * ^Y(yklsi?sj) *s se* °f output (emission) probability distributions. - size ofPY: |S|2x |Y| • Example: - S= {x, 1, 2, 3,4},s0 = x - Y={t,o,e} 10/18/00 JHUCSe00.465/IntfotoNLP/JanHajic 13 Formalization - Example Example (for graph, see foils 11,12): - S= {x, 1,2, 3,4},s0 = x - Y={e,o,t} X 1 2 3 4 X 0 .6 0 .4 0 1 0 0 .12 0 .88 2 0 0 0 0 1 3 0 1 0 0 0 4 0 0 1 0 0 e X 1 2 3 4 t 0 X i 1 2 2 3 4 4 I- J. X .8 .5 y 1 0 .1 2 0 3 0 4 0 >I = 1 10/18/00 JHU CS 6 00.465/Intro to NLP/Jan Haji c 14 Using the HMM • The generation algorithm (of limited value :-)): 1. Start in s = s0. 2. Move from s to s' with probability Ps(s'|s). 3. Output (emit) symbol yk with probability Ps(yk|s,s'). 4. Repeat from step 2 (until somebody says enough). • More interesting usage: - Given an output sequence Y = {y1?y2?...?yk}? compute its probability. - Given an output sequence Y = {yi,y2,...,yk}, compute the most likely sequence of states which has generated it. - ...plus variations: e.g., n best state sequences 10/18/00 JHUCSe00.465/IntfotoNLP/JanHajic 15 Introduction to Natural Language Processing (600.465) HMM Algorithms: Trellis and Viterbi Dr. Jan Hajic CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic 10/23^00 JHU CS 600.463 /Intro to NLP/J an Hajic 1 HMM: The Two Tasks HMM (the general case): - five-tuple (S, S0, Y, Ps, PY), where: * S = {slfs2v..fsT} is die set of states, S0 is the initial state, * Y = {yi?y2v?yv } *s ^e output alphabet, * Ps(Sj|s^) is the set ofprob. distributions of transitions, * I*Y(yklsi?sj) is me set of output (emission) probability distributions. Given an HMM & an output sequence Y = {ypy2,...;yk} (Task 1) compute the probability of Y; (Task 2) compute the most likely sequence of states which has generated Y. 10/23^00 JHUCSe00.465/IntfotoNLP/JanHajic 2 Trellis - Deterministic Output HMM: Trellis: time/position t 0 12 3 Y: t + o trellis state: (HMM state, position) a{m= l fl(A;1)= 6 a^2) each state: holds one number (prob): a a(c,i) = a probability or Y: in the last state 10/23/00 JHU CS 600.465/Intro to NLP/Jan Haji c Creating the Trellis: The Start position/stage • Start in the start state (x), o 1 - set its aXx,0) to 1. a = 1 • Create the first stage: - get the first "output" symbol yA - create the first stage (column) - but only those trellis states which generate yl Vi- t- - set their a(staiej) to the Pg(jtate|x) a(x,0) • ...and forget about the 0-th stage 1 10/23^00 JHU CS 600.465/1 ntro to NLP/Jan Haji c 4 ■ Trellis: The Next Step Suppose we are in stage i Creating the next stage: - create all trellis states in the next stage which generate yi+1? but only those reachable from any of the stage-i states - s et their a (state > i +1) to: Ps(state|prev.state) x a(prev.state, i) (add up all such numbers on arcs going to a common trellis state) ■-■ ...and forget about stage i position/stage i=l 10/23/00 JHU CS 600.4(55/Intro to NLP/Jan Haji c Trellis: The Last Step Continue until "output" exhausted - |Y| = 3: until stage 3 Add together all the a(state,\Y\) That's the Fffl. Observation (pleasant): - memory usage max: 2|S| - multiplications max: |S|2|Y| last position/stage ö= .568 a= .568 P(Y)= .568 10/23/00 JHU CS 600.465/1 ntro to NLP/Jan Haji c 6 Trellis: The General Case (still, bigrams • Start as usual: - start state (x), set its aXx/J) to 1. Q = 1 p(toe) = .48x.616x.6+ .2xlx.l76 + .2x1x12 = .237 10/23/00 JHUCS600.4ö5/IntfotoNLP/JanHajic 7 General Trellis: The Next Step 0 Oi= 1 We are in stage i: - Generate the next stage i + 1 as before (except now arcs generate output, thus use only those arcs marked by the output symbol yi+1) - For each generated state, compute a(state,i+1) = = 2„_gPy(yi+11state,prev.state) x a(p rev. state, j) position/stage incoming arcs Wa= .48 .and forget about stage i as usual. 10/23^00 JHU CS 600.4(55/Intro to NLP/Jan Haji c Trellis: The Complete Example The Case of Trigrams • Like before, but: - states correspond to bigrams, - output function always emits the second output symbol of the pair (state) to which the arc goes: Multiple paths not possible —> trellis not really needed 10/23/00 JHUCS600.465/IntrotoNLP/JanHajic 10 Trigrams with Classes More interesting: - n-gram class LM: pCwJw^^w^) = p(wjq) ptcjq.^q.j) -» states are pairs of classes (q.1?q); and emit "words": (letters in our example) p(t|C) = 1 usual, p(o|V) = .3 non- p(e|V) = .6 overlapping p(y|V) = .1 classes p(toe) = .6 X 1 x .88 x .3 x .07 X , 6 = .00665 p(teo) = .6 X 1 X .gg X . 6 X .07 X , 3 S .00665 p(totf= .6 X 1 x .88 x .3 x .07 X , 1 h .00111 y(\ty) = .6 X 1 X .12 X 1 X 1 X . 1 S .0072 10/23.00 JHU CS 600.465/Intfo to NLFtf anHaji.:c 11 Class Trigrams: the Trellis ■ Trellis generation (Y - "toy"): p(t|C) = l p(o|V) = .3 p(e|V)=.6 p(y|v) = .i m again, trellis useful but not re ally ne eded a= .15S4x .07 x .1 = .00111 c<= .6 x .88 x .3 o y 10/23/00 JHU CS 600.4(55/Intro to NLP/Jan Haji c 12 Overlapping Classes • Imagine that classes may overlap - e.g. 'r' is sometimes vowel sometimes consonant, belongs to V as well as C: P(t|C) = .3 p(r|C) = .7 p(o|V) = .1 p(e|V)=.3 p(y|V) = .4 p(r|V) = .2 p(try) = ? 10/23/00 JHUCS600.465/IntrotoNLP/JanHajic 13 Overlapping Classes: Trellis Example 10/23/00 JHUC So" 00.4(55/Intro to NLP/Jan Hajic 14 Trellis: Remarks • So far, we went left to right (computing a) • Same result: going right to left (computing p) - supposed we know where to start (finite data) • In fact, we might start in the middle going left and right • Important for parameter estimation (Forward-Backward Algortihm alias Baum-Welch) • Implementation issues: - scaling/normalizing probabilities, to avoid too small numbers & addition problems with many transitions 10/23/00 JHUCSe00.465/IntfotoNLP/JanHajic 15 The Viterbi Algorithm • Solving the task of finding the most likely sequence of states which generated the observed data • i.e., finding Sbest=argmaxsP(S|Y) which is equal to (Y is constant and thus P(Y) is fixed): SbeSt=argmaxsp(S,Y) = - argmax^Cs^s^s^.-.^y^y,,...^) - = argmaxsni=1^kp(y1|s1,s1.1)p(s1|s1.1) 10/23^00 JHU CS 6 00.465/Intro to NLP/Jan Haji c The Crucial Observation Imagine the trellis build as before (but do not compute the as yet; assume they are o.k.); stage /': a- .4 s NB: remember previous state from which we got the maximum: a = max(.3,.32) = .32 stage 1 2 "reverse" the arc ?......max! f o= 32 this is certainly the "backwards" maximum to (D,2)... but it cannot change even whenever we go forward (M. Property: Limited History) 10/23^00 JHUCS600.465/IntrotoNLP/JanHajic 17 Viterbi Example • V classification (C or V?, sequence?): p(t|C) = .3 p(r|C) = .7 p(o|V) = .l p(e|V)=.3 p(y|V) = .4 p(r|V) - .2 argmaxXYS p(rry|XYZ) = ? Possible state seq.: (x,v)(v,c)(c,v)[vcv]? (x,c)(c,c)(c,v)[ccv]f (xx)(c,v)(v,v) [cw] 10/23,00 JHUCS600.465/InteotoNLP/JanHajic IS Viterbi Computation 10/23/00 JHUCS600.465/IntrotoNLP/JanHajic \9 n-best State Sequences Keep track of n best "back pointers" Ex.: n= 2: Two 4^nners": VCV (best) CCV (2nd best) .42 x .12 x .7 .03528 Ot= .07392 x .07 x 10/2 3.0 0 Q= .08 x 1 x .7 = .056 Q= .4x .2 = D8 JHU CS 600.465/Intro to NLP/Jan Haji c c = .03528 x 1 x ?{ =.01411 OiyC = .056 x .8 x .4 'Cm792>w Tracking Back the n-best paths • Backtracking-style algorithm: * Start at the end, in the best of the n states (%es^) * Put the other n-1 b e st no des/back p ointer pairs on stack, except tho se leading from sbestto die same best-back state. • Follow the back "beam" towards the start of the data, spitting out nodes on the way (backwards of course) using always only the best back pointer. • At every beam split, push the diverging node/back pointer pairs onto the stack (node/heam width is sufficient!). • When you reach the start of data, close the path, and pop the topmost node/back pointer(width) pair from the stack. • Repeat until the stack is empty; expand the result tree if necessary. 10/23/00 JHU CS 6 00.465/Intro to NLP/Jan Haji c Pruning Sometimes, too many trellis states in a stage: it \y* *^a= .002 criteria: (a) a < threshold (b) J.% < threshold (c) # of states > threshold (get rid of smallest a) 10/23/00 :000435 :»= .0066 JHU CS 6 00.465/Intro to NLP/Jan Haji c