HMM Parameter Estimation: the
Baum-Welch algorithm
PA154 Language Modeling (6.1)
Pavel Rychlý
pary@fi.muni.cz March 26,2024
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
HMM: The Tasks
■ HMM(the general case):
■ five-tuple (S, So, /, Ps, Py), where:
■ 5 = {si,s2,... ,57-} is the set of states, S0 is the initial state,
■ y = {yi,y2,...,»} is the output alphabet,
■ Ps{Sj\si) is the set of prob. distributions of transitions,
■ Py(y*|s/,sy) is the set of output (emission) probability distributions.
■ Given an HMM & an output sequence Y = {yi.yj,... ,y^}:
■ (Task 1) compute the probability of /;
■ (Task 2) compute the most likely sequence of states which has generated /
■ (Task 3) Estimating the parameters (transition/output distributions)
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
2/14
A variant of Expectation-Maximization
■ ldea(^EM, for another variant see LM smoothing (Lecture 3.2)):
■ Start with (possibly random) estimates of Ps and Py.
■ Compute (fractional) "counts" of state transitions/emissions taken, from Ps and Py, given data Y
■ Adjust the estimates of Ps and Py from these "counts" (using MLE, i.e. relative frequency as the estimate).
■ Remarks:
■ many more parameters than the simple four-way smoothing
■ no proofs here; see Jelinek Chapter 9
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
3/14
Setting
■ HMM (without Ps, Py)(S, S0j Y)9 and data T = {y,- e ^}/=i...|7|
■ will use T ~ \ T\
■ HMM structure is given: (S, So)
■ P5: Typically, one wants to allow "fully connected" graph
■ (i.e. no transitions forbidden ~ no transitions set to hard 0)
■ why? —>► we better leave it on the learning phase, based on the data!
■ sometimes possible to remove some transitions ahead of time
■ Py : should be restricted (if not, we will not get anywhere!)
■ restricted ~ hard 0 probabilities of p(y|s,s')
■ "Dictionary": states    words, "m:n" mapping on S x Y (in general)
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
4/14
Initialization
■ For computing the initial expected "counts"
■ Important part
■ EM guaranteed to find a local maximum only (albeit a good one in most cases)
■ Py initialization more important
■ fortunately, often easy to determine
■ together with dictionary    vocabulary mapping, get counts, then MLE
■ Ps initialization Less important
■ e.g. uniform distribution for each p(.|s)
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
5/14
Data structures
Will need storage for:
■ The predetermined structure of the HMM (unless fully connected -» need not to keep it!)
■ The parameters to be estimated (P5, PY)
■ The expected counts (same size as (P5, Py))
■ The training data T = {// e Y}j=1^T
■ The trellis (if f.c):
t T    Size: T X S (Precisely, |T|x|S|)
Each trellis state:
(no [float] numbers O O O Q (forwards ack ward)       ®    W>   «3> W
and then some)
Pavel Rychlý • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
6/14
The Algorithm Part I
1. Initialize Ps,Py
2. Compute "forward" probabilities:
■ follow the procedure for trellis (summing), compute a(s, /') everywhere
■ use the current values of Ps,Py(p(s/|s),p(y|s,s/)) :
a(5V) = Es^s>a(V- 1) x p(s'|s) x p(y/|s,s/)
■ NB: do not throw away the previous stage!
3. Compute "backward" probabilities
■ start at all nodes of the last stage, proceed backwards, /3(s, /')
■ i.e., probability of the "tail" of data from stage i to the end of data
= £S'<-s/3(V + 1) x P(5I5') x Pto+il^s)
■ also, keep the /3(s, /') at all trellis states
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
7/14
The Algorithm Part II
4. Collect counts:
■ for each output/transition pair compute
c(y,s,s!) = ^lik.,     a<s4) p(s'|s) p(yj+1|s,s')i_p(s\i+l)
..1 ~~-r
prefix prob.    f. .. . tail prob
one pass through data, / this transition prob
only stop at (output) y x output prob
c(s, s') = J2yeY C(^'s>s/) (assuming all observed y,- in Y)
5. Reestimate: p'(J\s) = c(s,J)/c(s)  p,(y\s,s/) = c(y,s,s,)/c(s,s/)
6. Repeat 2-5 until desired convergence Limit is reached
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
8/14
Baum-Welch: Tips & Tricks
■ Normalization badly needed
■ long training data —>► extremely small probabilities
■ Normalize a, (3 using the same norm.factor: as follows:
■ compute a(s, /') as usual (Step 2 of the algorithm), computing the sum N(i) at the given stage / as you go.
■ at the end of each stage, recompute all as (for each state s):
a*(s,/) = a(s,/)/W(/)
■ use the same N(i) for f3s at the end of each backward (Step 3)
stage:
/3*(s,/) = /3(s,/)/A/(/)
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
9/14
Example
■ Task: pronunciation of "the"
■ Solution: build HMM, fully connected, 4 states:
■ S - short article, L - long article, C,V - word starting w/consonant, vowel
■ thus, only "the" is ambiguous (a, an, the - not members of C,V)
■ Output form states only (p(w|s, s') = p(w\s'))
•   DataY: aii   egg    and   a   piece of   the big      ....     die end Trellis: ®    © O ....... ©
© © o.!
Pavel Rychlý • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024 10 /14
Example: Initialization
■ Output probabilities:
■ P/n/t(w|c) = c(c, w)/c(c); where c(S, the) = c(/_, the) = c(the)/2 (other than that, everything is deterministic)
■ Transition probabilities:
■ Pmit{d\c) = l/4(uniform)
■ Don't forget:
■ about the space needed
■ initialize a(X, 0) = 1 (X : the never-occuring front buffer st.)
■ initialize /3(s, 7") = 1 for all s (except for s = X)
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
11/14
Fill in alpha, beta
Left to right, alpha:
a(s^ 0 = Y;s^s> a(si / — 1) x p(s'|s) x p(wi\s,)i where sf is the output from states
Remember normalization (N(i)).
Similary, beta (on the way back from the end).
an egg
and    a   piece of   die big
©
0
p(V,fi) = p(L,7)p(LlV)p(th«;L)+ vfcj}
the end
y q.
(l(S,7)p(S|\0p(the,S)
a(V,3) = a.(U7)p(C|L)p(bi&C)+ a(S,7)p(C|S)p(bi&C)
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
12/14
Counts & Reestimation
■ One pass through data
■ At each position /, go through all pairs (s/,S/+i)
■ Increment appropriate counters by frac. counts (Step 4)
■ inc(y/+i,s/,s/+i) = a(SiJ)p(Si+1\Si)p{yi+1\Si+1)b(Si+1j+1
■ c(y, S/, S/+i)+ = inc (for y at pos i+1)
■ c(s/,s/+i)+ = inc (always)
■ c(s,-)+ = inc (always)
inc(big,L,Q=a(L, 7)p(C\L)p(b\q,QP(V, 8) inc(big,S,C)=a(S, 7)p(C|S)p(big,C)/3(\/, 8)
■ Reestimate p(s/|s),p(y|s)
■ and hope for increase in p(C\S) and p(V\L)...!!
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
HMM: Final Remarks
■ Parameter "tying"
■ keep certain parameters same (~ just one "counter" for all of them)
■ any combination in principle possible
■ ex.: smoothing (just one set of lambdas)
■ Real Numbers Output
■ Y of infinite size (/?,/?")
■ parametric (typically: few) distribution needed (e.g., "Gaussian")
■ "Empty" transitions: do not generate output
■ ~ vertical areas in trellis; do not use in "counting"
Pavel Rychly • HMM Parameter Estimation: the Baum-Welch algorithm • March 26, 2024
14/14