Introduction to Natural Language Processing (600.465) HMM Parameter Estimation: the Baum-Welch Algorithm Dr. Jan Hajic CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic 10/2400 JHU CS 600.463 /Intro to NLP/J an Hajic HMM: The Tasks • HMM (the general case): - five-tuple (S, S0, Y, Ps, PY), where: * S = {slfS2v..fsT} is die set of states, S0 is the initial state, * Y'= {yify2v?yv } 1S *he output alphabet, * Ps(sj|sj)is the set ofprob. distributions of transitions, * Py(yk|sifSj) is the set of output (emission) probability distributions. • Given an HMM & an output sequence Y = {yi,y25--5yk}: J (Task 1) compute the probability of Y; J (Task 2) compute the most likely sequence of states which has generated Y. (Task 3) Estimating the parameters (transition/output distributions) 10/2400 JHU CS 600.465/Intro to NLP/Jan Haji c A Variant of EM • Idea (~ EM, for another variant see LM smoothing): - Start with (possibly random) estimates of Ps andPY. - Compute (fractional) "counts" of state transitions/emissions taken, from Ps and PY? given data Y. - Adjust the estimates of Ps and PYfrom these "counts" (using the MLE, i.e. relative frequency as the estimate). • Remarks: - many more parameters than the simple four-way smoothing - no proofs here; see Jelinek, Chapter 9 10/2400 JHU CS 600.465/Intro to NLP/Jan Haji c Setting * HMM (without Ps, PY) (S, S0? Y), and data T = {ye Y}1=1 n • will use T ~ |T| - HMM structure is given: (S, S0) - Ps:Typicallyf one wants to allow "fully connected" graph • (i.e. no transitions forbidden ~ no transitions set to hard 0) • why? —> we better leave it on the learning phase, based on the data! • sometimes possible to remove some transitions ahead of time - PY: should be restricted (if not, we will not get anywhere!) • restricted ~ hard 0 probabilities of p(y|s,s') • ""Dictionary": states <-> words, "m:n" mapping on S x Y (in general) 10/2400 JHU CS 600.465/Intro to NLP/Jan Haji c 4 Initialization • For computing the initial expected "counts" • Important part - EM guaranteed to find a local maximum only (albeit a good one in most cases) • PY initialization more important - fortunately, often easy to determine • together with dictionary <-> vocabulary mapping, get counts, then MLE • Ps initialization less important - e.g. uniform distribution for each p(.|s) 10/2400 JHU CS 600.465/Intro to NLP/Jan Haji c Data Structures • Will need storage for: - The predetermined structure of the HMM (unless fully connected need not to keep it!) - The parameters to be estimated (Ps? PY) - The expected counts (same size as Ps? PY) - The training data T= {y1 e Y}i=1 T - The trellis (if f.C.): | T Size: TXS (Precisely, |T|x|S|) Each trellis state: two [float] numbers (for w ard/b ack war d) © © © © © © © 0 © © © © © © © © JHU CS 6 00.465/Intro to NLP/Jan Haji c The Algorithm Part I 1. Initialize Ps, PY 2. Compute "forward" probabilities: • follow the procedure for trellis (summing), compute a(s,i) everywhere • use the current values ofP3, Py (p(s'|s), p(y|s,s')): a(s54) = Ss^^(s?i-l) xp(s'|s) xp(yi|s,s5) • NB: do not throw away the previous stage! 3. Compute "backward" probabilities • start at all nodes of the last stage, proceed backwards, jS(s,i) • i.e., probability of the "tail" of data from stage i to the end of data ^(s'4) = IWs£(s,i+l) x p(s|s') xp(yi+1|s',s) • also, keep the jS(s,i) at all trellis states 10/2400 JHUCSe00.465/IntfotoNLP/JanHajic 7 The Algorithm Part II 4. Collect counts: - for each output/transition pair compute c(y,s,s') = Z1=0 k.1;^+ia(s,i) p(s'|s) p(y1+1|s,s') p(s\i+l) i—T^^f^' prefix proK ~ ~~^ ~, , tailprob one pass through data, / **s transit!on prob only stop at (output) y x output prob c(s,s') = SyeY c(y,s,s') (assuming all observed yl in Y) c(s) = Zs,eSc(s?s5) 5. Reestimate: p'(s'|s) = c(s,s')/c(s) p'(y|s,s') = c(y,s,s')/c(s,s') 6. Repeat 2-5 until desired convergence limit is reached. 10/2400 JHUCSe00.465/IntfotoNLP/JanHajic 8 Baum-Welch: Tips & Tricks • Normalization badly needed - long training data extremely small probabilities • Normalize a,p using the same norm, factor: N(i) = ZseS a(s,i) as follows: • compute ct(sj.) as usual (Step 2 of the algorithm), computing the sum N(i) at the given stage i as you go. • at the end of each stage> recompute all as (for each state s): a*(s,i) = a(s,i)/N(i) • use the same N(i) for jSs at the end of each backward (Step 3) stage: £*(s,i) = £(s?i)/N(i) 10/2400 JHUCSe00.465/IntfotoNLP/JanHajic 9 Example Task: pronunciation of "the" Solution: build HMM, fully connected, 4 states: * S - short article, L - long article, C,V - word starting w/consonant, vowel * thus, only "the" is ambiguous (a, an, the - not members of C,V) Output from states only (p(w|s,s') = p(w|s')) Data Y: an egg and a piece of the big .... the end Ü © Trellis: Q .Q © ....... @ © © ©_! O © Q, 10/2400 JHU CS 6 00.465/Intro to NLP/Jan Haji c Example: Initialization • Output pr obabili ti es: pmit(w|c) = c(cfw) / c(c); where c(Sfthe) = c(Lfthe) = c(the)/2 (other than that, everything is deterministic) • Transition probabilities: - P1mt(c'lc) = 1/4 (uniform) • Don't forget: - about the space needed - initialize a(X,0) = 1 (X : the never-occurring front buffer st.) - initialize P(s,T) = 1 for all s (except for s = X) 10/2400 JHU CS 600.465/Intro to NLP/Jan Haji c 11 Fill in alpha, beta Left to right, alpha: a(s'4) " %M a(s,i-l) xp(s'|s) xp(Wl|s'J Remember normalization (N(i)). Similarly, beta (on the way back from the end). an egg and a piece of ihe big a - " Na -output from states ihe end a 10/2400 ^ = Kfl If ^ S3 2^LC1C2' where A is the language alphabet, L is the set of lemmas: - extension to punctuation, sentence boundaries (treated as words) 10/31^00 JHU CS 600.465/1 ntro to NLP/Jan Haji c The Setting Noisy Channel setting: Input (tags) NNP VBZ DT. The channel (adds "noise") Output (words) John drinks the • Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W? the word sequence) -p(T|W) = p(W|T)p(T)/p(W) - p(W) fixed (W given)... argmaxT p(T|W) = argmaxT p(W|T) p(T) 10/31^00 JHU CS 6 00.465/Intro to NLP/Jan Haji c The Model • Two models (d=|W|=|T| word sequence length): - p(W|T) = E[1=1 dp(w1|w1;...fw1.lft1?...ftd) - p(T) = n1=1 ..dpctiiti..... • Too much parameters (as always) • Approximation using the following assumptions: • words do not depend on the context • tag depends on limited history: p(^|tlv..;^_j) =p(t-|t-_n+i?---?tj_j) - n-gram tag "language" model • word depends on tag only: p(wj|w1?...?Wj_1?t1?...?t[|) =p(wj|t^) 10/31^00 JHUCSe00.465/IntfotoNLP/JanHajic The HMM Model Definition • (Almost) the general HMM: - output (words) emitted by states (not arcs) - states: (n-1)-tuples of tags if n-gram tag model used - five-tuple (S, s0? Y, Ps, PY), where: * S = {sgfs1?S2?...fSr} is die set of states, Sg is the initial state, * Y = {yi?y2v-?yv} *s ^e output alphabet (the words), * ]^Sj}-sj.) is the set of prob. distributions of transitions ~ ^s(sjlsi) = Pttilti-n+l'—^i-l)' sj = (*i-n+2'-"^)' H = (*i-n+l^"^-l) * Py(yklsi) is the set of output (emission) probability distributions - another simplification: Py(yklsi) = Py(yklsj) sian^ sj contain me same tag as the rightmost element: Py(yklsi) = P(wil^) 10/31^00 JHU CS 600.465/Intro to NLP/JanHajic 5 Supervised Learning (Manually Annotated Data Available) • Use MLE - p(wjt,) = %^m§ / ct(t) - Pftl^.n+^-.^Vj) - Ctn(tin+1?...?tll?t1) / ^.^(^^^...jtjj) • Smooth (both!) - piwjy: "Add 1" for all possible tag,word pairs using a predefined dictionary (thus some 0 kept!) - pitjtj.^p.-.jtj.j): linear interpolation: • e.g. for irigram model: v\(hh2>h-i)= \ p^hiA-O+ x2 išŠá+ xi Ä+ h1 ivtI 10/31^00 JHUCS600.465/IntfotoNLP/JanHajic 6 Unsupervised Learning • Completely unsupervised learning impossible - at least if we have the tagset given- how would we associate words with tags? • Assumed (minimal) setting: - tagset known - dictionary/morph. analysis available (providing possible tags for any word) • Use: Baum-Welch algorithm (see class 15, 10/13) - "tying": output (state-emitting only, same dist. from two states with same "final" tag) 10/31^00 JHU CS 6 00.465/Intro to NLP/Jan Haji c Comments on Unsupervised Learning • Initialization of Baum-Welch - is some annotated data available, use them - keep 0 for impossible output probabilities • Beware of: - degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!) - use heldout data for cross-checking • Supervised almost always better 10/31^00 JHU CS 6 00.465/Intro to NLP/Jan Haji c Unknown Words • "OOV" words (out-of-vocabulary) - we do not have list of possible tags for them - and we certainly have no output probabilities • Solutions: - try all tags (uniform distribution) - try op en-class tags (uniform, unigram distribution) - try to "guess" possible tags (based on suffix/ending) -use different output distribution based on the ending (and/or other factors, such as capitalization) 10/31/00 JHU CS 6 00.465/Intro to NLP/Jan Haji c Running the Tagger • U se Viterbi - remember to handle unknown words - single-best, n-best possible • Another option: - assign always the best tag at each word, but consider all possibilities for previous tags (no back pointers nor a path-backp ass) - introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors) 10/31^00 JHU CS 600.465/Intro to NLP/Jan Haji c (Tagger) Evaluation • A must. Test data (S), previously unseen (in training) - change test data often if at all possible! ("feedback cheating") - Error-rate based ■ Formally: - Out(w) = set of output "items" for an input "item" w - True(w) = single correct output (annotation) for w - Errors(S) = I1=1 |S|S(Out(Wi) % True(w1)) - Correct(S) = Z1=1 ^StTrue^) e Out(wt)) - Generated(S) = Z1=1 JOuKw^l 10/31^00 JHU CS 600.465/1 ntro to NLP/Jan Haji c 11 Evaluation Metrics • Accuracy: Single output (tagging: each word gets a single tag) - Error rate: Err(S) = Errors(S) / |S| - Accuracy: Acc(S) = 1 - (Errors(S) / |S|) = 1 - Err(S) • What if multiple (or no) output? - Recall: R(S) = Correct(S) / |S| - Precision: P(S) = Correct(S) / Generated(S) - Combination: F measure: F = 1 / (aJP + (l-a)/R) • a is a weight given to precision vs. recall; for a=.5, F = 2PR/(R+P) 10/31/00 JHU CS 600.465/1 ntro to NLP/Jan Haji c Introduction to Natural Language Processing (600.465) Transformation-Based Tagging Dr. Jan Hajic CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic LWWQ J HU C S 6 0 0.465 /I ntr o to N LP/J an H aji c The Task, Again • Recall: - tagging ~ morphological disambiguation - tagsetVTC^C^-CJ • G|- - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,... - mapping w {t eVT} exists • restriction of Morphological Analysis: A+—> 2^LC1C2' where A is the language alphabet, L is the set of lemmas: - extension to punctuation, sentence boundaries (treated as words) LWWQ JHU CS 600.465/Intro to NLP/Jan Haji c Setting • Not a source channel view • Not even a probabilistic model (no "numbers" used when tagging a text after a model is developed) • Statistical, yes: • uses training data (combination of supervised [manually annotated data available] and unsupervised [plain text, large volume] training) • learning [rules] • criterion: accuracy (that's what we are interested in in the end, after all!) 11/01,00 JHU CS 6 00.465/Intro to NLP/Jan Haji c The General Scheme Training Tagging Annotated graining data/ The Learner Remove tags "AID without ^annotation Assign initial tags LWWQ JHU CS 6 00.465/Intro to NLP/Jan Haji c The I/O of an Iteration • In (iteration i): - Intermediate data (initial or the result of previous iteration) - The TRUTH (the annotated training data) - [pool of possible rules] • Out: - One rule rselected^ to enhance the set of rules learned so far - Intermediate data (input data transformed by the rule learned in this iteration, rselected(l)) LWWQ JHU CS 6 00.465/Intro to NLP/Jan Haji c The Initial Assignment of Ta • One possibility: - NN • Another: - the most frequent tag for a given word form • Even: - use an HMM tagger for the initial assignment • Not particularly sensitive LWWQ JHU CS 6 00.465/Intro to NLP/Jan Haji c The Criterion Error rate (or Accuracy): - beginning of an iteration: some error rate Ein - each possible rule r, when applied at every data position: * makes an improvement somewhere in the data * makes it worse at some places (^^^(0) * and, of course, does not touch the remaining data Rule contribution to the improvement of the error rate * contrib(r) = cmpmd(r) - c^^r) Rule selection at iteration i: * = argma^ contrib(r) New error rate: Eout = Ein - contrib(rselected(l)) 11/0100 JHUCS600.465/IntfotoNLP/JanHajic 8 The Stopping Criterion • Obvious: - no improvement can be made * contrib(r) < 0 - or improvement too small * contrib(r) < Threshold • NB: prone to overtraining! - therefore, setting a reasonable threshold advisable • Heldout? - maybe: remove rules which degrade performance on H LWWQ JHU CS 6 00.465/Intro to NLP/Jan Haji c The Pool of Rules (Templates) Format: change tag at position ifrom atob/ condition Context rules (condition definition - "template"): w 1-3 fe W: 1-2 W: 1-1 t 1-2 t l-l W1+1 W1+2 W1+3 t l+l t 1+2 +3 LWWQ -1 Instantiation: w, t permitted JHUCSe00.465/IntfotoNLP/JanHajic 10 Lexical Rules Other type: lexical rules w 1-3 Li-3 W: 1-2 Li-2 W: 1-1 Ll-1 Example: - wt has suffix -ied - w1 has prefix ge- W; w 1+1 Wi+2 Wi+3 t t t Li+1 Li+2 Li+3 "look inside the word" imwo JHU CS 6 00.465/Intro to NLP/Jan Haji c 11 Rule Application • Two possibilities: - immediate consequences (left-to-right) • data: DTNN VBPS VBP ISM.. • rule: NN —> NNS / preceded by NN VBP • apply rule at position 4: DTNNVBPlNSVBPNN... — DT NN VBP[NM! VBP NN^^L • ...then rule cannot apply at position 6 (context notNN VBP). - delayed ("fixed input"): • use original input for context • the above rule then applies twice. 11/01,00 JHUCS600.465/IntfotoNLP/JanHajic 12 In Other Words. • 1. Strip the tags off the truth, keep the original truth • 2. Initialize the stripped data by some simple method • 3. Start with an empty set of selected rules S. • 4. Repeat until the stopping criterion applies: - compute the contribution of the rule r, for each r: contrib(r) = cmFwed(r) - c^^r) - select r which has the biggest contribution contrib(r), add it to the final set of selected rules S. • 5. Output the set S. LWWQ JHU CS 6 00.465/Intro to NLP/Jan Haji c The Tagger • Input: - untagged data - rules (S) learned by the learner • Tagging: - use the same initialization as the learner did - for i = 1. .n (n - the number of rules learnt) * apply the rule i to Ihe whole intermediate data, changing (some) tags - the last intermediate data is the output. LWWQ JHU CS 6 00.465/Intro to NLP/Jan Haji c 14 N-best & Unsupervised Modifications • N-best modification - allow adding tags by rules - criterion: optimal combination of accuracy and the number of tags per word (we want: close to il) • Unsupervised modification - use only unambiguous words for evaluation criterion - work extremely well for English - does not work for languages with few unambiguous words LWWQ JHU CS 6 00.465/Intro to NLP/Jan Haji c 15