Introduction to Natural Language Processing (600.465)
HMM Parameter Estimation: the Baum-Welch Algorithm
Dr. Jan Hajic CS Dept., Johns Hopkins Univ.
haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic
10/2400
JHU CS 600.463 /Intro to NLP/J an Hajic
HMM: The Tasks
• HMM (the general case):
- five-tuple (S, S0, Y, Ps, PY), where:
* S = {slfS2v..fsT} is die set of states, S0 is the initial state,
* Y'= {yify2v?yv } 1S *he output alphabet,
* Ps(sj|sj)is the set ofprob. distributions of transitions,
* Py(yk|sifSj) is the set of output (emission) probability distributions.
• Given an HMM & an output sequence Y = {yi,y25--5yk}:
J (Task 1) compute the probability of Y;
J (Task 2) compute the most likely sequence of states which has generated Y.
(Task 3) Estimating the parameters (transition/output distributions)
10/2400
JHU CS 600.465/Intro to NLP/Jan Haji c
A Variant of EM
• Idea (~ EM, for another variant see LM smoothing):
- Start with (possibly random) estimates of Ps andPY.
- Compute (fractional) "counts" of state transitions/emissions taken, from Ps and PY? given data Y.
- Adjust the estimates of Ps and PYfrom these "counts" (using the MLE, i.e. relative frequency as the estimate).
• Remarks:
- many more parameters than the simple four-way smoothing
- no proofs here; see Jelinek, Chapter 9
10/2400
JHU CS 600.465/Intro to NLP/Jan Haji c
Setting
* HMM (without Ps, PY) (S, S0? Y), and data T = {ye Y}1=1 n
• will use T ~ |T|
- HMM structure is given: (S, S0)
- Ps:Typicallyf one wants to allow "fully connected" graph
• (i.e. no transitions forbidden ~ no transitions set to hard 0)
• why? —> we better leave it on the learning phase, based on the data!
• sometimes possible to remove some transitions ahead of time
- PY: should be restricted (if not, we will not get anywhere!)
• restricted ~ hard 0 probabilities of p(y|s,s')
• ""Dictionary": states <-> words, "m:n" mapping on S x Y (in general)
10/2400
JHU CS 600.465/Intro to NLP/Jan Haji c
4
Initialization
• For computing the initial expected "counts"
• Important part
- EM guaranteed to find a local maximum only (albeit a good one in most cases)
• PY initialization more important
- fortunately, often easy to determine
• together with dictionary <-> vocabulary mapping, get counts, then MLE
• Ps initialization less important
- e.g. uniform distribution for each p(.|s)
10/2400
JHU CS 600.465/Intro to NLP/Jan Haji c
Data Structures
• Will need storage for:
- The predetermined structure of the HMM
(unless fully connected    need not to keep it!)
- The parameters to be estimated (Ps? PY)
- The expected counts (same size as Ps? PY)
- The training data T= {y1 e Y}i=1 T
- The trellis (if f.C.): | T     Size: TXS (Precisely, |T|x|S|)
Each trellis state: two [float] numbers (for w ard/b ack war d)
© © © ©
© © © 0
© © © ©
© © © ©
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Algorithm Part I
1. Initialize Ps, PY
2. Compute "forward" probabilities:
• follow the procedure for trellis (summing), compute a(s,i) everywhere
• use the current values ofP3, Py (p(s'|s), p(y|s,s')):
a(s54) = Ss^^(s?i-l) xp(s'|s) xp(yi|s,s5)
• NB: do not throw away the previous stage!
3. Compute "backward" probabilities
• start at all nodes of the last stage, proceed backwards, jS(s,i)
• i.e., probability of the "tail" of data from stage i to the end of data
^(s'4) = IWs£(s,i+l) x p(s|s') xp(yi+1|s',s)
• also, keep the jS(s,i) at all trellis states
10/2400 JHUCSe00.465/IntfotoNLP/JanHajic 7
The Algorithm Part II
4. Collect counts:
- for each output/transition pair compute
c(y,s,s') = Z1=0 k.1;^+ia(s,i) p(s'|s) p(y1+1|s,s') p(s\i+l)
i—T^^f^' prefix proK   ~ ~~^   ~, , tailprob
one pass through data, / **s transit!on prob
only stop at (output) y x output prob
c(s,s') = SyeY c(y,s,s') (assuming all observed yl in Y) c(s) = Zs,eSc(s?s5)
5. Reestimate: p'(s'|s) = c(s,s')/c(s) p'(y|s,s') = c(y,s,s')/c(s,s')
6. Repeat 2-5 until desired convergence limit is reached.
10/2400 JHUCSe00.465/IntfotoNLP/JanHajic 8
Baum-Welch: Tips & Tricks
• Normalization badly needed
- long training data    extremely small probabilities
• Normalize a,p using the same norm, factor:
N(i) = ZseS a(s,i) as follows:
• compute ct(sj.) as usual (Step 2 of the algorithm), computing the sum N(i) at the given stage i as you go.
• at the end of each stage> recompute all as (for each state s):
a*(s,i) = a(s,i)/N(i)
• use the same N(i) for jSs at the end of each backward (Step 3) stage:
£*(s,i) = £(s?i)/N(i)
10/2400 JHUCSe00.465/IntfotoNLP/JanHajic 9
Example
Task: pronunciation of "the"
Solution: build HMM, fully connected, 4 states:
* S - short article, L - long article, C,V - word starting w/consonant, vowel
* thus, only "the" is ambiguous (a, an, the - not members of C,V)
Output from states only (p(w|s,s') = p(w|s'))
Data Y: an    egg    and    a   piece of   the big      ....     the end
Ü ©
Trellis: Q    .Q © ....... @
© © ©_!
O © Q,
10/2400
JHU CS 6 00.465/Intro to NLP/Jan Haji c
Example: Initialization
• Output pr obabili ti es:
pmit(w|c) = c(cfw) / c(c); where c(Sfthe) = c(Lfthe) = c(the)/2 (other than that, everything is deterministic)
• Transition probabilities:
- P1mt(c'lc) = 1/4 (uniform)
• Don't forget:
- about the space needed
- initialize a(X,0) = 1 (X : the never-occurring front buffer st.)
- initialize P(s,T) = 1 for all s (except for s = X)
10/2400
JHU CS 600.465/Intro to NLP/Jan Haji c
11
Fill in alpha, beta
Left to right, alpha:
a(s'4) " %M a(s,i-l) xp(s'|s) xp(Wl|s'J
Remember normalization (N(i)). Similarly, beta (on the way back from the end).
an    egg    and    a    piece  of   ihe big
a       -   " Na
-output from states
ihe end
a
10/2400
^ = Kfl If ^  S3       <V,8) =o,(U7)pCC|L)p(b1&C)+ p(S,7)p(S|V)p(the,S) a(S,7)p(C|S)p(^C)
JHUCS600.465/IntfotoNLP/JanHajic 12
Counts & Reestimation
• One pass through data
• At each position i, go through all pairs (sps1+1)
• Increment appropriate counters by frac. counts (Step 4):
* incfo+i^^+i) = a(Sl4) p(si+11Si) p(y1+1|s1+1) b(si+lfi+l)
* c(y,si?si+1)+= inc (for y at pos i+1)
* c(slfs1+1) += inc (always) of   me biS
* c(s^+= inc (always) 7^^(0,8)
inc(big,L,C) = a(LJ)p(C|L)p(bi&C)p(Y,8) inc(big,S,C) = a(S,7)p(C|S)p(bi&C)|3(VI8)
o(L,7)
♦ and hope for increase inp(C|S) and p(V|L)...! f
10/2400
JHU CS 600.465/1 ntro to NLP/Jan Haji c
HMM: Final Remarks
• Parameter 'tying":
- keep certain parameters same (-just one "counter" for all of them)
- any combination in principle possible
- ex.: smoothing (just one set of lambdas)
• Real Numbers Output
- Y of infinite size (R, Rn):
• parametric (typically: few) distribution needed (e.g./'Gaussian")
• "Empty" transitions: do not generate output
♦ ~ vertical arcs in trellis; do not use in "counting"
10/2400
JHU CS 6 00.465/Intro to NLP/Jan Haji c
14
Introduction to Natural Language Processing (600.465)
HMM Tagging
Dr. Jan Hajic CS Dept., Johns Hopkins Univ.
haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic
10/31^00
JHU CS 600.463/Intro to NLP/JanHajic
Review
• Recall:
- tagging ~ morphological disambiguation
- tagsetVTC^C^-CJ
• G|- - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,...
- mapping w     {t eVT} exists
• restriction of Morphological Analysis: A+—> 2^LC1C2'
where A is the language alphabet, L is the set of lemmas:
- extension to punctuation, sentence boundaries (treated as words)
10/31^00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
The Setting
Noisy Channel setting: Input (tags)
NNP VBZ DT.
The channel
(adds "noise")
Output (words)
John drinks the
• Goal (as usual): discover "input" to the channel (T, the tag seq.) given the "output" (W? the word sequence)
-p(T|W) = p(W|T)p(T)/p(W)
- p(W) fixed (W given)...
argmaxT p(T|W) = argmaxT p(W|T) p(T)
10/31^00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Model
• Two models (d=|W|=|T| word sequence length):
- p(W|T) = E[1=1 dp(w1|w1;...fw1.lft1?...ftd)
- p(T) = n1=1 ..dpctiiti.....
• Too much parameters (as always)
• Approximation using the following assumptions:
• words do not depend on the context
• tag depends on limited history: p(^|tlv..;^_j) =p(t-|t-_n+i?---?tj_j)
- n-gram tag "language" model
• word depends on tag only: p(wj|w1?...?Wj_1?t1?...?t[|) =p(wj|t^)
10/31^00 JHUCSe00.465/IntfotoNLP/JanHajic
The HMM Model Definition
• (Almost) the general HMM:
- output (words) emitted by states (not arcs)
- states: (n-1)-tuples of tags if n-gram tag model used
- five-tuple (S, s0? Y, Ps, PY), where:
* S = {sgfs1?S2?...fSr} is die set of states, Sg is the initial state,
* Y = {yi?y2v-?yv} *s ^e output alphabet (the words),
* ]^Sj}-sj.) is the set of prob. distributions of transitions
~ ^s(sjlsi) = Pttilti-n+l'—^i-l)' sj = (*i-n+2'-"^)' H = (*i-n+l^"^-l)
* Py(yklsi) is the set of output (emission) probability distributions
- another simplification: Py(yklsi) = Py(yklsj)    sian^ sj contain me same tag as the rightmost element: Py(yklsi) = P(wil^)
10/31^00 JHU CS 600.465/Intro to NLP/JanHajic 5
Supervised Learning (Manually Annotated Data Available)
• Use MLE
- p(wjt,) = %^m§ / ct(t)
- Pftl^.n+^-.^Vj) - Ctn(tin+1?...?tll?t1) / ^.^(^^^...jtjj)
• Smooth (both!)
- piwjy: "Add 1" for all possible tag,word pairs using a predefined dictionary (thus some 0 kept!)
- pitjtj.^p.-.jtj.j): linear interpolation:
• e.g. for irigram model:
v\(hh2>h-i)= \ p^hiA-O+ x2 išŠá+ xi Ä+ h1 ivtI
10/31^00 JHUCS600.465/IntfotoNLP/JanHajic 6
Unsupervised Learning
• Completely unsupervised learning impossible
- at least if we have the tagset given- how would we associate words with tags?
• Assumed (minimal) setting:
- tagset known
- dictionary/morph. analysis available (providing possible tags for any word)
• Use: Baum-Welch algorithm (see class 15, 10/13)
- "tying": output (state-emitting only, same dist. from two states with same "final" tag)
10/31^00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
Comments on Unsupervised Learning
• Initialization of Baum-Welch
- is some annotated data available, use them
- keep 0 for impossible output probabilities
• Beware of:
- degradation of accuracy (Baum-Welch criterion: entropy, not accuracy!)
- use heldout data for cross-checking
• Supervised almost always better
10/31^00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
Unknown Words
• "OOV" words (out-of-vocabulary)
- we do not have list of possible tags for them
- and we certainly have no output probabilities
• Solutions:
- try all tags (uniform distribution)
- try op en-class tags (uniform, unigram distribution)
- try to "guess" possible tags (based on suffix/ending) -use different output distribution based on the ending (and/or other factors, such as capitalization)
10/31/00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
Running the Tagger
• U se Viterbi
- remember to handle unknown words
- single-best, n-best possible
• Another option:
- assign always the best tag at each word, but consider all possibilities for previous tags (no back pointers nor a path-backp ass)
- introduces random errors, implausible sequences, but might get higher accuracy (less secondary errors)
10/31^00
JHU CS 600.465/Intro to NLP/Jan Haji c
(Tagger) Evaluation
• A must. Test data (S), previously unseen (in training)
- change test data often if at all possible! ("feedback cheating")
- Error-rate based
■ Formally:
- Out(w) = set of output "items" for an input "item" w
- True(w) = single correct output (annotation) for w
- Errors(S) = I1=1 |S|S(Out(Wi) % True(w1))
- Correct(S) = Z1=1 ^StTrue^) e Out(wt))
- Generated(S) = Z1=1 JOuKw^l
10/31^00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
11
Evaluation Metrics
• Accuracy: Single output (tagging: each word gets a single tag)
- Error rate: Err(S) = Errors(S) / |S|
- Accuracy: Acc(S) = 1 - (Errors(S) / |S|) = 1 - Err(S)
• What if multiple (or no) output?
- Recall: R(S) = Correct(S) / |S|
- Precision: P(S) = Correct(S) / Generated(S)
- Combination: F measure: F = 1 / (aJP + (l-a)/R)
•   a is a weight given to precision vs. recall; for a=.5, F = 2PR/(R+P)
10/31/00
JHU CS 600.465/1 ntro to NLP/Jan Haji c
Introduction to Natural Language Processing (600.465)
Transformation-Based Tagging
Dr. Jan Hajic CS Dept., Johns Hopkins Univ.
haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic
LWWQ
J HU C S 6 0 0.465 /I ntr o to N LP/J an H aji c
The Task, Again
• Recall:
- tagging ~ morphological disambiguation
- tagsetVTC^C^-CJ
• G|- - morphological categories, such as POS, NUMBER, CASE, PERSON, TENSE, GENDER,...
- mapping w     {t eVT} exists
• restriction of Morphological Analysis: A+—> 2^LC1C2'
where A is the language alphabet, L is the set of lemmas:
- extension to punctuation, sentence boundaries (treated as words)
LWWQ
JHU CS 600.465/Intro to NLP/Jan Haji c
Setting
• Not a source channel view
• Not even a probabilistic model (no "numbers" used when tagging a text after a model is developed)
• Statistical, yes:
• uses training data (combination of supervised [manually annotated data available] and unsupervised [plain text, large volume] training)
• learning [rules]
• criterion: accuracy (that's what we are interested in in the end, after all!)
11/01,00
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The General Scheme
Training
Tagging
Annotated graining data/
The Learner
Remove tags
"AID without ^annotation
Assign initial tags
LWWQ
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The I/O of an Iteration
• In (iteration i):
- Intermediate data (initial or the result of previous iteration)
- The TRUTH (the annotated training data)
- [pool of possible rules]
• Out:
- One rule rselected^ to enhance the set of rules learned so far
- Intermediate data (input data transformed by the rule learned in this iteration, rselected(l))
LWWQ
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Initial Assignment of Ta
• One possibility:
- NN
• Another:
- the most frequent tag for a given word form
• Even:
- use an HMM tagger for the initial assignment
• Not particularly sensitive
LWWQ
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Criterion
Error rate (or Accuracy):
- beginning of an iteration: some error rate Ein
- each possible rule r, when applied at every data position:
* makes an improvement somewhere in the data
* makes it worse at some places (^^^(0)
* and, of course, does not touch the remaining data
Rule contribution to the improvement of the error rate
* contrib(r) = cmpmd(r) - c^^r)
Rule selection at iteration i:
* = argma^ contrib(r)
New error rate: Eout = Ein - contrib(rselected(l))
11/0100 JHUCS600.465/IntfotoNLP/JanHajic 8
The Stopping Criterion
• Obvious:
- no improvement can be made
* contrib(r) < 0
- or improvement too small
* contrib(r) < Threshold
• NB: prone to overtraining!
- therefore, setting a reasonable threshold advisable
• Heldout?
- maybe: remove rules which degrade performance on H
LWWQ
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Pool of Rules (Templates)
Format: change tag at position ifrom atob/ condition Context rules (condition definition - "template"):
w
1-3
fe
W:
1-2
W:
1-1
t
1-2
t
l-l
W1+1     W1+2 W1+3
t
l+l
t
1+2
+3
LWWQ
-1 Instantiation: w, t permitted
JHUCSe00.465/IntfotoNLP/JanHajic 10
Lexical Rules
Other type: lexical rules
w
1-3
Li-3
W:
1-2
Li-2
W:
1-1
Ll-1
Example:
- wt has suffix -ied
- w1 has prefix ge-
W;
w
1+1
Wi+2 Wi+3
t t t
Li+1       Li+2 Li+3
"look inside the word"
imwo
JHU CS 6 00.465/Intro to NLP/Jan Haji c
11
Rule Application
• Two possibilities:
- immediate consequences (left-to-right)
• data: DTNN VBPS VBP ISM..
• rule: NN —> NNS / preceded by NN VBP
• apply rule at position 4: DTNNVBPlNSVBPNN... — DT NN VBP[NM! VBP NN^^L
• ...then rule cannot apply at position 6 (context notNN VBP).
- delayed ("fixed input"):
• use original input for context
• the above rule then applies twice.
11/01,00 JHUCS600.465/IntfotoNLP/JanHajic 12
In Other Words.
• 1. Strip the tags off the truth, keep the original truth
• 2. Initialize the stripped data by some simple method
• 3. Start with an empty set of selected rules S.
• 4. Repeat until the stopping criterion applies:
- compute the contribution of the rule r, for each r:
contrib(r) = cmFwed(r) - c^^r)
- select r which has the biggest contribution contrib(r), add it to the final set of selected rules S.
• 5. Output the set S.
LWWQ
JHU CS 6 00.465/Intro to NLP/Jan Haji c
The Tagger
• Input:
- untagged data
- rules (S) learned by the learner
• Tagging:
- use the same initialization as the learner did
- for i = 1. .n (n - the number of rules learnt)
* apply the rule i to Ihe whole intermediate data, changing (some) tags
- the last intermediate data is the output.
LWWQ
JHU CS 6 00.465/Intro to NLP/Jan Haji c
14
N-best & Unsupervised Modifications
• N-best modification
- allow adding tags by rules
- criterion: optimal combination of accuracy and the number of tags per word (we want: close to il)
• Unsupervised modification
- use only unambiguous words for evaluation criterion
- work extremely well for English
- does not work for languages with few unambiguous words
LWWQ
JHU CS 6 00.465/Intro to NLP/Jan Haji c
15