HMM: The Tasks
HMM Parameter Estimation: the Baum-Welch algorithm
PA154 Jazykové modelování (6.1)
Pavel Rychlý
pary@fi.muni.cz
April 6, 2021
HMM (the general case):
► five-tuple (5, So, Y, Ps, Py), where:
► S — {si, 52,..., sr} is the set of states, So is the initial state,
► Y — {yi,y2,... ,yy} is the output alphabet,
► Ps(sj\s;) is the set of prob. distributions of transitions,
* Py(yk\s;, Sj) is the set of output (emission) probability distributions.
Given an HMM & an output sequence Y — {yi,y2, ■ ■ ■,y>}:
► (Task 1) compute the probability of Y;
► (Task 2) compute the most likely sequence of states which has generated Y
► (Task 3) Estimating the parameters (transition/output distributions)
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic
n-Welch algorithm2/14
A variant of Expectation-Maximization
Setting
ldea(~EM, for another variant see LM smoothing (lect. 3)):
► Start with (possibly random) estimates of Ps and Py.
► Compute (fractional) "counts" of state transitions/emissions taken, from Ps and Py, given data Y
► Adjust the estimates of Ps and Py from these "counts" (using MLE, i.e. relative frequency as the estimate).
Remarks:
► many more parameters than the simple four-way smoothing
► no proofs here; see Jelinek Chapter 9
HMM (without Ps,fY)(S,S0, Y), and data T = {y, e Y}!=1 _\T\
► will use T ~ | 7"|
HMM structure is given: (S,S0)
Psm. Typically, one wants to allow "fully connected" graph
► (i.e. no transitions forbidden ~ no transitions set to hard 0)
► why?     we better leave it on the learning phase, based on the data!
► sometimes possible to remove some transitions ahead of time Py : should be restricted (if not, we will not get anywhere!)
► restricted ~ hard 0 probabilities of p(y\s,s')
► "Dictionary": states o words, "m:n" mapping on S x Y (in general)
: the Baum-Welch algorithm3/14
: the Baum-Welch algorithm4/14
Initialization
Data structures
For computing the initial expected "counts" Important part
EM guaranteed to find a local maximum only (albeit a good one in most cases) Py initialization more important
► fortunately, often easy to determine
► together with dictionary o vocabulary mapping, get counts, then MLE Ps initialization less important
► e.g. uniform distribution for each p(.|s)
Will need storage for:
► The predetermined structure of the HMM (unless fully connected need not to keep it!)
► The parameters to be estimated (Ps, Py)
► The expected counts (same size as (Ps, Py))
► The training data T = {y,- £ V};=i...r
► The trellis (if f.c):
Each trellis state: two [float] numbers (forward/backward)
__f T    Size: T x S (Precisely, |T|x|S|)
© © © @ O & © © ©   ©   © ©
O © G O
Cr
o
i
and then some)
: the Baum-Welch algorithm5/14
PA154 Jazykové modeli
: the Baum-Welch algorithm6/14
The Algorithm Part I
The Algorithm Part II
Initialize Ps, Py
Compute "forward" probabilities:
► follow the procedure for trellis (summing), compute a{s, i) everywhere
► use the current values of Ps, Py{p{s'\s), p(y\s, s')) : ais', 0 = Es^s>"(s>' - 1) x P(s'|s) x p{yi\s,s')
► NB: do not throw away the previous stage!
Compute "backward" probabilities
► start at all nodes of the last stage, proceed backwards, /?(s, /)
► i.e., probability of the "tail" of data from stage i to the end of data
0 = Es-«/3(s,' + 1) x P(s\s') x P(yi+i\s',s)
► also, keep the /?(s, /) at all trellis states
Collect counts:
► for each output/transition pair compute
c(y,s,s') = Si30k., j=Si,a(s,i) p(s'|s) p(y,+1|s,s') p(s\i+l) t   ~--" \.
tail prob
one pass through data, only stop at (output) y
prefix prob.
this transition prob x output prob
c(s, s') =     ey c(y? s> s') (assuming all observed y,- in Y)
c(s) = E^esc(s,s') Reestimate: p'(s'\s) = c(s,s')/c(s)   p'(y\s,s') = c(y,s,s')/c(s,s') Repeat 2-5 until desired convergence limit is reached
: the Baum-Welch algorithm7/14
HMM Parameter Estimation: the Baum-Welch algorithm8/14
Baum-Welch: Tips & Tricks
Example
Normalization badly needed
► long training data     extremely small probabilities Normalize a,/3 using the same norm.factor:
w(0 = £«=s <*(*.'')
as follows:
► compute a(s, i) as usual (Step 2 of the algorithm), computing the sum N(i) at the given stage / as you go.
► at the end of each stage, recompute all alphas(for each state s):
a*(s,i) =a(s,i)/N(i)
► use the same N(i) for [3s at the end of each backward (Step 3) stage:
F(s,i)=p(s,i)/N(i)
Task: pronunciation of "the"
Solution: build HMM, fully connected, 4 states:
► S - short article, L - long article, C,V - word starting w/consonant, vowel
► thus, only "the" is ambiguous (a, an, the - not members of C,V) Output form states only (p(w|s,s') — p(w\s'))
*   Dsta. I'. an    egg    and    a   piece  of   the big      ....      the end
& ©
Trellis: ©    0 © ....... ß
o
: the Baum-Welch algorithm9/14
HMM Parameter Estimation: the Banm-Welch algorithm 10/14
Example: Initialization
Fill in alpha, beta
Output probabilities:
► Pinit(w\c) = c(c, w)/c(c); where c(S, the) = c(L, the) = c(the)/2 (other than that, everything is deterministic)
Transition probabilities:
► pi„it(c'\c) = l/A(uniform) Don't forget:
► about the space needed
► initialize a(X,0) = 1 (X : the never-occuring front buffer st.)
► initialize /?(s, 7") = 1 for all s (except for s = X)
Left to right, alpha:
o(s', /) = 2MS, q(s, / — 1) x p(s'|s) x p(w,-|s'), where s' is the
output from states
Remember normalization (N(i)).
Similary, beta (on the way back from the end).
an    egg    and    a   piece of   the big      ....     die end
p(S,7)ji(S|Y)p(tlicS) o<S.7)p(C|S:ip(bi&C)
: the Baum-Welch algorithm
PA154 Jazykové modeli
Counts & Reestimation
HMM: Final Remarks
One pass through data
At each position /, go through all pairs (s,-,s,-+i) Increment appropriate counters by frac. counts (Step 4):
► inc(y/+1,s,-,s/+1) = a(s,, i)p(si+1\s,)p(yi+1\si+1)b(si+1J+1)
► c(y,s;,s;+i)+ = inc (for y at pos i+1)
► c(s(, sJ+i)+= inc (always)
► c(s/)+ = inc (always)
inc(big,L,C)=a(L)7)p(C|L)p(big,C)^(V)8) inc(big,S,C)=a(S)7)p(C|S)p(big,C)^(V)8) Reestimate p(sf\s), p(y\s)
► and hope for increase in p(C\S) and p(V\L)... !!
of   the big
HMM Para 13/14
: the Baum-Welch algorithm
Parameter" tying"
► keep certain parameters same (~ just one "counter" for all of them)
► any combination in principle possible
► ex.: smoothing (just one set of lambdas) Real Numbers Output
► Y of infinite size (R, Rn)
► parametric (typically: few) distribution needed (e.g., "Gaussian") "Empty" transitions: do not generate output
► ~ vertical areas in trellis; do not use in "counting"
HMM Para 14/14
: the Baum-Welch algorithm