HMM Parameter Estimation: the Baum-Welch algorithm PA154 Jazykové modelování (6.1) Pavel Rychlý pary@fi.muni.cz April 6, 2021 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic HMM: The Tasks ■ HMM(the general case): ► five-tuple (S, So, V, P5, Py), where: ► S = {si, S2,..., sr} is the set of states, So is the initial state, ► Y = {yi, 3/2, • • •,yy} is the output alphabet, ► Ps(sj\si) is the set of prob. distributions of transitions, ► Py{yk\s;1 Sj) is the set of output (emission) probability distributions. ■ Given an HMM & an output sequence Y = {yi,y2, • • • ,y/c}: ► (Task 1) compute the probability of Y; ► (Task 2) compute the most likely sequence of states which has generated Y ► (Task 3) Estimating the parameters (transition/output distributions) PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm2/14 A variant of Expectation-Maximization ■ ldea(~EM, for another variant see LM smoothing (lect. 3)): Start with (possibly random) estimates of Ps and Py. ► Compute (fractional) "counts" of state transitions/emissions taken, from Ps and Py, given data Y ► Adjust the estimates of Ps and Py from these "counts" (using MLE, i.e. relative frequency as the estimate). ■ Remarks: ► many more parameters than the simple four-way smoothing ► no proofs here; see Jelinek Chapter 9 PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm3/14 Setting ■ HMM (without Ps, Py)(S, S0, Y), and data T = {y; e V}/=1>>>|T| ► will use T ^ | 7"| ■ HMM structure is given: (S, So) ■ Ps- Typically, one wants to allow "fully connected" graph ► (i.e. no transitions forbidden ^ no transitions set to hard 0) ► why? —)► we better leave it on the learning phase, based on the data! ► sometimes possible to remove some transitions ahead of time ■ Py : should be restricted (if not, we will not get anywhere!) ► restricted ^ hard 0 probabilities of p(y|s,s/) ► "Dictionary": states words, "m:n" mapping on S x Y (in general) PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm4/14 Initialization ■ For computing the initial expected "counts" ■ Important part ► EM guaranteed to find a local maximum only (albeit a good one in most cases) ■ Py initialization more important ► fortunately, often easy to determine ► together with dictionary vocabulary mapping, get counts, then MLE ■ Ps initialization less important ► e.g. uniform distribution for each p(.|s) PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm5/14 Data structures Will need storage for: ► The predetermined structure of the HMM (unless fully connected —>► need not to keep it!) ► The parameters to be estimated (Ps, Py) ► The expected counts (same size as (Ps, Py)) ► The training data T = {y,- %3> k3 £ Q Q Q Q ®£and1henS<>me) PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm6/14 The Algorithm Part I □ Initialize P5, Py Q Compute "forward" probabilities: ► follow the procedure for trellis (summing), compute a(s, /) everywhere ► use the current values of Ps, Py(p(s!\s), p(y|s, s')) : ') = Es^s' a(s> / - 1) x p(s'|s) x p(y/|s, s') ► NB: do not throw away the previous stage! & Compute " backward" probabilities ► start at all nodes of the last stage, proceed backwards, /3(s, /) ► i.e., probability of the "tail" of data from stage i to the end of data P(s',i) = Es'<-s/3(s>/ +x) x p(sIs/) x P(yi+i\s'is) ► also, keep the /3(s, /) at all trellis states PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm7/14 The Algorithm Part II Collect counts: ► for each output/transition pair compute c(y,s,s') = S1=0 k.ly=55+ia(s,i) p(s'|s) p(yj+1|s,s') p(s\i+l) > ZJ~—í prefix prob. . . , .. . tail prob one pass through data, / this transition prob only stop at (output) y x output prob c(s, s') = J2yeY c(y, s, s') (assuming all observed y-, in Y) cls) = Es'esc(s>s') Reestimate: pf(sf\s) — c{s,s')/c{s) p'{y\s,s') — c(y,s,s,)/c(s,s/) Repeat 2-5 until desired convergence limit is reached PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm8/14 Baum-Welch: Tips & Tricks ■ Normalization badly needed ► long training data —>* extremely small probabilities ■ Normalize a,/3 using the same norm.factor: as follows: ► compute a(s, /) as usual (Step 2 of the algorithm), computing the sum A/(/) at the given stage / as you go. ► at the end of each stage, recompute all a/p/?as(for each state s): a*(s,i) = a(s,i)/N(i) ► use the same A/(/) for f3s at the end of each backward (Step 3) stage: P*(s,i) = /3(s,i)/N(i) PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm9/14 Example ► S - short article, L - long article, C,V - word starting w/consonant, vowel thus, only "the" is ambiguous (a, an, the - not members of C,V) ■ Task: pronunciation of "the" ■ Solution: build HMM, fully connected, 4 states: ► S - short article, L - long article, C,V - word starting w/consona vowel ► thus, only "the" is ambiguous (a, an, the - not members of C,V ■ Output form states only (p(w|s, s;) = p(w|s/)) • DataY: an egg and a piece of the big .... die end Trellis: © © O ....... O © © O O G, PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm 10/14 Example: Initialization ■ Output probabilities: ► Pinit(w\c) = c(c, w)/c(c); where c(S, the) = c(/_, the) = c(the)/2 (other than that, everything is deterministic) ■ Transition probabilities: ► Pinit(c'\c) = l/4(uniform) ■ Don't forget: ► about the space needed ► initialize a(X, 0) = 1 (X : the never-occuring front buffer st.) ► initialize /3(s, 7") = 1 for all s (except for s = X) PA154 Jazykové modelování (6.1) HMM Parameter Estimation: the Baum-Welch algorithm 11/14 Fill in alpha, beta ■ Left to right, alpha: o(s', /) = X^s-s>s' a(s'' — 1) x p(s'ls) x p(wi\s')< where s' is the output from states ■ Remember normalization (N(i)). ■ Similary, beta (on the way back from the end). an egg and a piece of the big .... the end o ©.i q, ß(V, 6) = ß(L,7)p