Markov Models
PA154 Language Modeling (5.1)
Pavel Rychlý
pary@fi.muni.cz March 19,2024
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
Review: Markov Process
i Bayes formula (chain rule):
P(W) = P(wi, w2,wr) = n/=i.Tp(w/
n-gram Language modeLs:
■ Markov process (chain) of the order n-1:
•3 Wi-n+lj --j ^/-l)
P(|/|/) = p(wl5 W2,     Wr) = n/=i.Tp(W/ I W/-a7+1, Wj-n+2,
Using just one distribution (Ex.: trigram model:       | w,_2,
Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 Words: My car
broke down ,
and within hours Bob's can
broke down ,
too
p(,| broke down) = p(w5 | w-$, W4) = p(wi4 | wu, ^13)
Pavel Rychlý • Markov Models • March 19,2024
2/15
Markov Properties
■ Generalize to any process (not just words/LM):
■ Sequence of random variables: X = (Xl, Xj,..., Xj)
■ Sample space S (states), size N: S = (So, Si,S2,..
1. Limited History (Context, Horizon):
V/ G l.T; P(X, I       ..^/.i) = (X, I
S/v)
5...      1 7 3 79 0 6(04 5...
— ->
2. Time invariance (M.C. is stationary, homogenous)
V/ G 1.T, Vy,x G S; P(X,- = y | X/_! = x) = p(y
10|1]0[9]O60[3]45...
ok...same distribution
Pavel Rychlý • Markov Models • March 19,2024
3/15
Long History Possible
What if we want trigrams: 1 7 3 79 0[67][T]45...
i Formally, use transformation:
Define new variables Qi.suchthatXi = 0,_i, 0 Then
P(X!\Xi_1) = P(Qi_1,Qi\Qi-2,Qi-Predicting (X,) 1 7 3 7 [90] 6 7 3 4 5...
History (X, = {Q/_2, Q/-i}):
0	6	7	3	4
9	0	6	7	3
Pavel Rychlý • Markov Models • March 19,2024
Graph Representation: State Diagram
■ S - {so, Si, Sj,..., sn }'. states
■ Distribution P(X,- |
■ transitions (as arcs) with probabilities attached to them:
Bigram case:
p(toe) = .6 x .88 x 1 = .528
Pavel Rychlý • Markov Models • March 19,2024
5/15
The Trigram Case
■ S = {so,Si,S2,... ,sn}: states: pairs s,- = (x,y)
■ Distribution P(X; |        (r.v. X: generates pairs s,)
p(toe) = .6 x .88 x .07 = .037 p(one) = ?
Pavel Rychlý • Markov Models • March 19,2024
6/15
Finite State Automaton
■ States ~ symbols of the [input/output] alphabet
■ pairs (or more): last element of the n-tuple \
■ Arcs ~ transitions (sequence of states) I
■ [Classical FSA: alphabet symbols on arcs: /
■ transformation: arcs    nodes] /
■ Possible thanks to the "Limited history" Mairav Property
■ So far: Visible Markov Models (WMMf
Pavel Rychlý • Markov Models • March 19,2024
7/15
Hidden Markov Models
■ The simplest HMM: states generate [observable] output (using the"data" alphabet) but remain "invisible":
p(toe) = .6 x .88 x 1 = .528
Pavel Rychlý • Markov Models • March 19,2024
8/15
Added Flexibility...
■ So far, no change; but different states may generate the same output (why not?):
p(toe) = .6 x .88 x 1 +
.4x1x1 = .568
Pavel Rychlý • Markov Models • March 19,2024
9/15
Output from Arcs...
Added flexibility: Generate output from arcs, not states
p(toe) = .6 x .88 x 1 + .4x1x1 + .4 x .2 x .3 + .4 x .2 x .4 = .624
Pavel Rychlý • Markov Models • March 19,2024
10/15
...and Finally, Add Output Probabilities
■ Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc:
(simplified! p(t)=.8 p(t)=o
P(o)=0 p(e)=l
p(toe) = .6 x .8 x .88 x .7 x 1 x .6 + .4 x .5 x 1 x 1 x .88 x .2 + .4 x .5 x 1 x 1 x .12 x 1
PW=° ~ .237
Pavel Rychlý • Markov Models • March 19,2024
11/15
Slightly Different View
■ ALLow for muLtipLe arcs from s,- -» Sy, mark them by output symboL s, get rid of output distributions:
o,.06 e,.12
p(toe) = .48 x .616 x .6 + .2 x 1 x .176 +
.2 x 1 x .12 ^ .237
In the future, we wiLL use the view more convenient for the problem at hand.
Pavel Rychlý • Markov Models • March 19,2024
12/15
Formalization
HMM (the most general case):
■ five-tuple (S, Sq, Y, Ps, Py), where:
■ S = {so,Si,s2,... ,sj} is the set of states, s0 is the initial state,
■ Y = {yi,y2,... ,yv} is the output alphabet,
■ Ps(sj | si) is the set of prob. distributions of transitions,
■ size of Ps :| 5 |2.
■ Py(y* I s/)sy) 's ^l16 set of output (emission) probability distributions.
■ size of PY :| S |2 x | V
Example:
■ S= x, 1,2, 3,4, Sq = x
■ Y={f,o,e}
Pavel Rychlý • Markov Models • March 19,2024
13/15
Formalization - Example
Example (for graph, see foils 11,12):
S= {x Y= {e
l,2,3,4},s0 = x f}
0
5 v5
	X	1	2	3	4
X	0	.6	0	.4	0
1	0	0	.12	0	.88
2	0	0	0	0	1
3	0	1	0	0	0
4	0	0	1	0	0
				V	V —
		e	x	1	2	3	
	0	x	1	2	3	4	
t	x	1	2	3	4		.2
x		.8		.5	^	.7	
1			0		.1		
2					0		
3		0					
4			0				
Pavel Rychlý • Markov Models • March 19,2024
14/15
Using the HMM
The generation algorithm (of Limited value :-)):
1. Start in s = so.
2. Move from s to s'with probability Ps(s' | s).
3. Output (emit) symbol   with probability Py(yk \ s,s;).
4. Repeat from step 2 (until somebody says enough).
More interestirig usage:
■ Given an output sequence Y = {yi,/2, • • • ,Yk} compute its probability.
■ Given an output sequence Y = {ylly2l... ,yk} compute the most likely sequence of states which has generated it.
■ ...plus variations: e.g., n best state sequences
Pavel Rychlý • Markov Models • March 19,2024
15/15