MUNI
FI
Markov Models
PA154 Language Modeling (5.1)
Pavel Rychly
pary@fi.muni.cz March 19,2024
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept.,Johns Hopkins Univ. www.cs.jh u.edu/~hajic
Review: Markov Process
■ Bayes formula (chain rule):
P(W) = P(wu W2, ..,WT) = n;=i..7-p(W; | Wltni-Jt-, Wi-n+l,••, W;_
■ n-gram language models:
■ Markov process (chain) of the order n-1:
P(W) = P(WUW2,..,WT) = n;=i.Tp(W; I W;_n+i,W;_n+2,..,W;-l
Using just one distribution (Ex.: trigram model: p(w, | W;_2,W;_
Positions: 1 2       3   4    5    6    7        8    9   10  11 12    13  1415 16
Words: My car\broke down
and within hours Bob's can
broke down , too .
p(,| broke down) = p(w$ | w3, vv4) = p{w14 | W12, Wu)
^avel Rychlý ■ Markov Models ■ March 19,2024
Markov Properties
Long History Possible
■ Generalize to any process (not just words/LM):
■ Sequence of random variabLes: X = (X±,X2,... ,Xj)
■ SampLe space S (states), size N: S = (So, Si, S2,..., S«)
1. Limited History (Context, Horizon):
V/el..7-;P(X;|X1,.,X;_0 = (X;|X;_0
|T|4 5...      1 7 3 79 0 6J|[T]45..
2. Time invariance (M.C. is stationary, homogenous)
V/ g 1..7", Vy,x g S; P(X, = y |       = x) = p{y | x)
>v—f >v—^     ok...same distribution
■ What if we want trigrams: 17 5 7 9 0ITTIITI4 5...
1 Formally, use transformation:
Define new variables Oj,suchthatXj = Qj-i,Qj: Then
P[Xi\Xi-i) = PiQi-uQi\Qi-2,Qi-i)
Predicting (X,) 1 7 5 7 [9~Ö| 6 7 5 4 5...
History (x, = {0,_2, &-i}):
X	1	7	3	...	0	6	7	3	4
X	X	1	7	-A	9	0	6	7	3
^avel Rychlý ■ Markov Models ■ March 19,2024
^avel Rychlý ■ Markov Models ■ March 19,2024
Graph Representation: State Diagram
■ S = {s0,si,s2,...,s„}: states
■ Distribution P(X, | X,-_i):
■ transitions (as arcs) with probabilities attached to them:
Bigram case:
sum of outgoing probs = 1
The Trigram Case
S = {s0,s1,s2,... ,s„}: states: pairs s, = (x,y) Distribution P(X, | X;_i): (r.v. X: generates pairs s,)
1
p(toe) = .6 x .88 x .07 = .037
p(one) = ?
p(toe) = .6 x .88 x 1 = .528
^avel Rychlý ■ Markov Models ■ March 19,2024
^avel Rychlý ■ Markov Models ■ March 19,2024
Finite State Automaton
Hidden Markov Models
States ~ symbols of the [input/output] alphabet c
■ pairs (or more): Last eLement of the n-tupLe Arcs ~ transitions (sequence of states) [Classical FSA: alphabet symbols on arcs:
■ transformation: arcs o nodes] Possible thanks to the "limited history" MajKov Property So far: Visible Markov Models (VMN
The simplest HMM: states generate [observable] output (using the"data" alphabet) but remain "invisible":
p(toe) = .6 x .88 x 1 = .528
^avel Rychlý ■ Markov Models ■ March 19,2024
^avel Rychlý ■ Markov Models ■ March 19,2024
Added Flexibility...
■ So far, no change; but different states may generate the same output (why not?):
Output from Arcs.
p(toe) = .6 x .88 x 1 +
.4 x .1 x 1 = .568
Added flexibility: Generate output from arcs, not states:
p(toe) = .6 x .88 x 1 + .4x.lxl + .4 x .2 x .3 + .4 x .2 x .4 = .624
^avel Rychlý ■ Markov Models ■ March 19,2024
^avel Rychlý ■ Markov Models ■ March 19,2024
.and Finally, Add Output Probabilities
Slightly Different View
Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc: p(t)=
p(o)=0
p(toe) = .6 x .8 x .88 x .7 x 1 x .6 + .4 x .5 x 1 x 1 x .88 x .2 + .4 x .5 x 1 x 1 x .12 x 1 p(t)=0
.237
p(o)=.4 p(e)=.6
Allow for multiple arcs from s, —► sy-, mark them by output symbol s, get rid of output distributions:
O..06 e,.12
p(toe) = .48 x .616 x .6 + .2 x 1 x .176 +
.2 x 1 x .12 .237
In the future, we will use the view more convenient for the problem at hand.
^avel Rychlý ■ Markov Models ■ March 19,2024
^avel Rychlý ■ Markov Models ■ March 19,2024
Formalization
Formalization - Example
HMM (the most general case):
■ five-tuple (S, s0, Y,Ps,Py), where:
■ S = {s0,Si,s2,... ,sT} is the set of states, s0 is the initial, state,
■ Y = {yi,Y2, ■ ■ ■ ,yv] is the output alphabet,
■ Ps(Sj | Sj) is the set of prob. distributions of transitions,
■ size of Ps :| 5 |2.
■ Py(yk I Si,Sj) is the set of output (emission) probability distributions.
■ size of Py :| 5 |2 x | Y\
Example:
■ S= x, 1,2, 3,4, s0 = x
■ Y={t,o,e}
^avel Rychlý ■ Markov Models ■ March 19,2024
Using the HMM
The generation algorithm (of limited value :-)):
1. Start in s = s0.
2. Move from sto s'with probability Ps(s' | s).
3. Output (emit) symbol yk with probability Py(yk \s,s').
4. Repeat from step 2 (until somebody says enough). More interestirig usage:
■ Given an output sequence V = {yi,yj, ■■■,/*} compute its probability.
■ Given an output sequence V = {yi,yj, ■■■,/*} compute the most likely sequence of states which has generated it.
■ ...plus variations: e.g., n best state sequences
Example (for graph, see foils 11,12):
■ S={x,l,2,3,4},s0 = x
■ Y= {e,o,t}
m1
^avel Rychlý ■ Markov Models ■ March 19,2024
^avel Rychlý ■ Markov Models ■ March 19,2024
15/15