HMM Parameter Estimation: the Baum-Welch algorithm PA154 Language Modeling (6.1) Pavel Rychlý pary@fi.muni.cz March 26,2024 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic HMM: The Tasks ■ HMM(the general case): ■ five-tuple (S, So, /, Ps, Py), where: ■ 5 = {si,s2,... ,57-} is the set of states, S0 is the initial state, ■ y = {yi,y2,...,»} is the output alphabet, ■ Ps{Sj\si) is the set of prob. distributions of transitions, ■ Py(y*|s/,sy) is the set of output (emission) probability distributions. ■ Given an HMM & an output sequence Y = {yi.yj,... ,y^}: ■ (Task 1) compute the probability of /; ■ (Task 2) compute the most likely sequence of states which has generated / ■ (Task 3) Estimating the parameters (transition/output distributions) Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 2/14 A variant of Expectation-Maximization ■ ldea(^EM, for another variant see LM smoothing (Lecture 3.2)): ■ Start with (possibly random) estimates of Ps and Py. ■ Compute (fractional) "counts" of state transitions/emissions taken, from Ps and Py, given data Y ■ Adjust the estimates of Ps and Py from these "counts" (using MLE, i.e. relative frequency as the estimate). ■ Remarks: ■ many more parameters than the simple four-way smoothing ■ no proofs here; see Jelinek Chapter 9 Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 3/14 Setting ■ HMM (without Ps, Py)(S, S0j Y)9 and data T = {y,- e ^}/=i...|7| ■ will use T ~ \ T\ ■ HMM structure is given: (S, So) ■ P5: Typically, one wants to allow "fully connected" graph ■ (i.e. no transitions forbidden ~ no transitions set to hard 0) ■ why? —>► we better leave it on the learning phase, based on the data! ■ sometimes possible to remove some transitions ahead of time ■ Py : should be restricted (if not, we will not get anywhere!) ■ restricted ~ hard 0 probabilities of p(y|s,s') ■ "Dictionary": states words, "m:n" mapping on S x Y (in general) Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 4/14 Initialization ■ For computing the initial expected "counts" ■ Important part ■ EM guaranteed to find a local maximum only (albeit a good one in most cases) ■ Py initialization more important ■ fortunately, often easy to determine ■ together with dictionary vocabulary mapping, get counts, then MLE ■ Ps initialization Less important ■ e.g. uniform distribution for each p(.|s) Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 5/14 Data structures Will need storage for: ■ The predetermined structure of the HMM (unless fully connected -» need not to keep it!) ■ The parameters to be estimated (P5, PY) ■ The expected counts (same size as (P5, Py)) ■ The training data T = {// e Y}j=1^T ■ The trellis (if f.c): t T Size: T X S (Precisely, |T|x|S|) Each trellis state: (no [float] numbers O O O Q (forwards ack ward) ® W> «3> W and then some) Pavel Rychlý • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 6/14 The Algorithm Part I 1. Initialize Ps,Py 2. Compute "forward" probabilities: ■ follow the procedure for trellis (summing), compute a(s, /') everywhere ■ use the current values of Ps,Py(p(s/|s),p(y|s,s/)) : a(5V) = Es^s>a(V- 1) x p(s'|s) x p(y/|s,s/) ■ NB: do not throw away the previous stage! 3. Compute "backward" probabilities ■ start at all nodes of the last stage, proceed backwards, /3(s, /') ■ i.e., probability of the "tail" of data from stage i to the end of data = £S'<-s/3(V + 1) x P(5I5') x Pto+il^s) ■ also, keep the /3(s, /') at all trellis states Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 7/14 The Algorithm Part II 4. Collect counts: ■ for each output/transition pair compute c(y,s,s!) = ^lik., as/) (assuming all observed y,- in Y) 5. Reestimate: p'(J\s) = c(s,J)/c(s) p,(y\s,s/) = c(y,s,s,)/c(s,s/) 6. Repeat 2-5 until desired convergence Limit is reached Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 8/14 Baum-Welch: Tips & Tricks ■ Normalization badly needed ■ long training data —>► extremely small probabilities ■ Normalize a, (3 using the same norm.factor: as follows: ■ compute a(s, /') as usual (Step 2 of the algorithm), computing the sum N(i) at the given stage / as you go. ■ at the end of each stage, recompute all as (for each state s): a*(s,/) = a(s,/)/W(/) ■ use the same N(i) for f3s at the end of each backward (Step 3) stage: /3*(s,/) = /3(s,/)/A/(/) Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 9/14 Example ■ Task: pronunciation of "the" ■ Solution: build HMM, fully connected, 4 states: ■ S - short article, L - long article, C,V - word starting w/consonant, vowel ■ thus, only "the" is ambiguous (a, an, the - not members of C,V) ■ Output form states only (p(w|s, s') = p(w\s')) • DataY: aii egg and a piece of the big .... die end Trellis: ® © O ....... © © © o.! Pavel Rychlý • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 10 /14 Example: Initialization ■ Output probabilities: ■ P/n/t(w|c) = c(c, w)/c(c); where c(S, the) = c(/_, the) = c(the)/2 (other than that, everything is deterministic) ■ Transition probabilities: ■ Pmit{d\c) = l/4(uniform) ■ Don't forget: ■ about the space needed ■ initialize a(X, 0) = 1 (X : the never-occuring front buffer st.) ■ initialize /3(s, 7") = 1 for all s (except for s = X) Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 11/14 Fill in alpha, beta Left to right, alpha: a(s^ 0 = Y;s^s> a(si / — 1) x p(s'|s) x p(wi\s,)i where sf is the output from states Remember normalization (N(i)). Similary, beta (on the way back from the end). an egg and a piece of die big © 0 p(V,fi) = p(L,7)p(LlV)p(th«;L)+ vfcj} the end y q. (l(S,7)p(S|\0p(the,S) a(V,3) = a.(U7)p(C|L)p(bi&C)+ a(S,7)p(C|S)p(bi&C) Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 12/14 Counts & Reestimation ■ One pass through data ■ At each position /, go through all pairs (s/,S/+i) ■ Increment appropriate counters by frac. counts (Step 4) ■ inc(y/+i,s/,s/+i) = a(SiJ)p(Si+1\Si)p{yi+1\Si+1)b(Si+1j+1 ■ c(y, S/, S/+i)+ = inc (for y at pos i+1) ■ c(s/,s/+i)+ = inc (always) ■ c(s,-)+ = inc (always) inc(big,L,Q=a(L, 7)p(C\L)p(b\q,QP(V, 8) inc(big,S,C)=a(S, 7)p(C|S)p(big,C)/3(\/, 8) ■ Reestimate p(s/|s),p(y|s) ■ and hope for increase in p(C\S) and p(V\L)...!! Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 HMM: Final Remarks ■ Parameter "tying" ■ keep certain parameters same (~ just one "counter" for all of them) ■ any combination in principle possible ■ ex.: smoothing (just one set of lambdas) ■ Real Numbers Output ■ Y of infinite size (/?,/?") ■ parametric (typically: few) distribution needed (e.g., "Gaussian") ■ "Empty" transitions: do not generate output ■ ~ vertical areas in trellis; do not use in "counting" Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 14/14