Basic info
PV030 Textual Information Systems
Petr Sojka
Faculty of Informatics Masaryk University, Brno
Spring 20-12
Part
Information about the course PV030
O Textual Information Systems
quisites and classification e syllabus tu re
course
• Petr Sojka, sojkaOf i .muni . cz
• Consulting hours Spring 2012: Wednesday 13:00-13:45 Friday-10:00-1-1:50
or write an email with other suggestions to meet.
• Room C523/522, fifth floor of block C, Botanická 6£>a.
9 Couree homepage: http://www.fi.muni.cz/~sojka/PV030/ » Seminar (Thu -12:00-12:50, C511 -» B311).
Prerequisites:
It is expected the student having basic knowledge of the theory of automata and formal languages (IB005), elementary knowledge of theories of complexity, software programming and systems. Classification:
There is a system of classification based on written mid-term (30 points) and final (70 points) exams. In addition, one can get additional premium points based on the activities during lectures. Classification scale (changes reserved) z/k/E/D/C/B/A correspond to obtaining 4Ö/54/60/66/72/7Ö/Ö4 points.
Dates of [final] exams will be announced via IS.muni.cz (probably three terms).
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
My books focus on timeless truth.
D.E. Knuth, Brno, -1996
An emphasis will be given to the explanation of basic principles, algorithms and (software) design techniques, creation and implementation of textual information systems (TIS)—storage and information retrieval.
PetrSojka PV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
Literature
© Basic notions. TIS (text information system). Classification of
information systems. From texts to Watson. © Searching in TIS. Searching and pattern matching classification
and data structures. Algorithms of Knuth-Morris-Pratt,
Aho-Corasick, reg. expr. © Algorithms of Boyer-Moore, Commentz-Walter, Buczilowski. © Theory of automata for searching. Classification of searching
problems. Searching with errors. © Indexes. Indexing methods. Data structures for searching and
indexing.
© Google as an example of search and indexing engine. Pagerank.
Signature methods. © Query languages and document models: Boolean, vector,
probabilistic, MMM, Paice.
Petr Sojka Basic info	PV03 P re re Cours Liter*	0 Textual Information Systems quisites and classification >e syllabus 3ture
Textbooks		
© Data compression. Basic notions. Entropy.
© Statistical methods.
® Compression methods based on dictionary.
O Syntactic methods. Context modeling. Language modeling. Corpora linguistics.
© Spell checking. Filtering information channels. Document classification. Neural nets for text compression.
[MEL] Melichar, B.: Textové informační systémy, skripta ČVUT Vraha, 2, vydání, A 996.
[POK] Vokorný, J., Snášel, V., Húsek D:. Dokumentografícké informační systémy, Karolinum Vraha, A99&.
[KOR] Korfhage, R. R.: information Storage and Retrieval, Wiley Computer Publishing, -1997.
[5MY] Smyth, B.: Computing Patterns in Strings, Addison Wesley, 2005.
[KNU] Knuth, D. E.: The Art of Computer Programming, Vol. 5, Sorting and Searching, Second edition, A99&.
[WMB] Witten I. H., Moffat A., Bell T. C: Managing Gigabytes: compressing and indexing documents and images, Second edition, Morgan Kaufmann VuWishers, A99&.
PetrSojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
					Basic info	Prerequisites and classification Course syllabus Literature
	ler	study	mater	ials	■ (cont.)	
^ [HEL] Held, G.: Data and Image Compression, Tools and Techniques, John Wiley & Sons, 4. vydání -1996.
13 [MEH] Melichar B., Holub J., A 6D Classification of Pattern Matching Probléme, Proceedings of The Prague Stringology Club Workshop 97, Prague, July 7, CZ,
lul   [GOO] Brin 5., Page, L: The anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7/Computer Networks 30(1-7): A07-AA7 (A99&). http://dbpubs.Stanford.edu:8090/pub/1998-8
3   [MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural Language Engineering, 2{A):6A-8>0,A996.
http://www.research.att.com/~mohri/cll.ps.gz
jl   [Sch] Schmidhuber J.: Sequential neural text compression, IEEE Transactions on Neural Networks 7(A), -142—146, -1996, http://www.idsia.ch/~juergen/onlinepub.html
H   [SBA] Salton G., Buckley Ch., Allan J.: Automatic structuring of text files, Electronic Publishing <o(A), p. -1—17 (March -1992).
http://columbus.cs.nott.ac.uk/compsci/epo/epodd/ep056gs.htm
i   [WWW] web pages of the course ~sojka/PV030/, DIS seminars
http: //www. inf . upol. cz/dis, http: //nip. f i . muni . cz/, The Prague Stringology Club Workshop A996-2003 http://cs.felk.cvut.cz/psc/
^ Jones, S. K., Willett: Readings in Information Retrieval, Morgan Kaufman Publishers, -1997.
Petr Sojka	FV030 Textual Information Systems
	Prerequisites and classification
Basic info	Course syllabus
	Literature
^^^^^^^^^^^^^^ Other study materials (cont.)	
Petr Sojka	PV030 Textual Information Systems
Notions and classification of 15	
(T)IS classification	
Information retrieval systems	
Basic notions of TI5
^ Bell, T. C, Cleary, J. G., Witten, I. H.: Text Compression, Prentice Hall, Englewood Cliffs, N.J., -199-1.
^ Storer, J.: Data Compression: Methods and Theory, Computer Science Press, Rockwille, A9&&.
Q journals ACM Transactions on Information Systems, Theoretical
Computer Science, Neural Network World, ACM Transactions on Computer Systems, Knowledge Acquisition.
knihovna.muni . cz, umarecka. cz (textbook Pokorny),
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Notions and classification of 15
(T)IS classification Notions of (T)IS, FV030 in the context of teaching at Fl MU
Information retrieval systems
reality <—> data
T T
information need <—> query
is1 Abstractions and mappings in information systems.
is1 Information needs about the reality—queries above data.
is1 Jeopardy game: Watson.
Definition: Information eyetem is a system that allows purposeful arrangement of collection, storage, processing and delivering of information.
Definition: Ectosystem consists of IS users, investor of IS, and entrepreneur (user, funder, server). In the example of is.muni.cz they are users of IS, MU represented by bursar, and ICS and IS teams. Ectosystem is not under control of IS designer.
Definition: Endosystem consists of hardware weed {media, devices), and software (algorithms, data structures) and is under control of IS designer.
is1 effectiveness (user) is1 economics (funder) is1 efficiency (server)
and from different preferences implied compromises. Our view will be view of TIS architect respecting requests of IS ectosystem. For topics related to ectosystem of IS see PV045 Management IS.
• Data: concrete representation of a message in a form of sequence of symbols of an alphabet.
• Information: reflection of the known or the expected substance of realities. An information depends on the intended subject. Viewpoints:
• quantitative (information theory);
• qualitative (meaning, eemantice);
a pragmatical (valuation: eignificance, ueefulneee, usability, periodicity, up-to-datene55, credibility;
• the othere (promptneee, particularity, completeneee, univocality, availability, co5t5 of obtaining).
• Knowledge (znalost).
• Wisdom (moudrost).
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Notions and classification of 15 (T)IS classification Information retrieval systems
Notions of (T)IS, FV030 in the context of teaching at Fl MU
Notions and classification of IS
(T)IS classification Mini questionnaire
Information retrieval systems
?y the prevailing function
questionnaire
Definition: Information proceee is a process of formation of information, its representation in a form of data, its processing, providing, and use. Operations with information correspond to this process.
Data/signals —> Information —> Knowledge —> Wisdom.
© Information retrieval systems.
© Database management systems (DBMS), relational DB (PB154,
PB155, PVOC3, PV055, PV136, PB114). © Management information systems (PV045). © Decision support systems (PV09Ö).
© Expert systems, question answering systems, knowledge-based systems (PA03A).
© Information service systems (web 2.0).
Fetr Sojka
Notions and classification of IS (T)IS classification Information retrieval systems
FV030 Textual Information Systems
i questionnaire
Fetr Sojka
Notions and classification of IS (T)IS classification Information retrieval systems
FV030 Textual Information Systems
questionnaire
the prevailing function (cont.)
rspectives
© Specific information systems (geographical PV019, PA049, PA050, medical PV04Ö, environmental PV044, corporate PV043, state administration PV05Ö, PV059, librarian PV070); and also PV063. Application of database systems.
Related fields taught in Fl: Software engineering (PAA02, PAA05). Similarity searching in multimedia data (PM2&). Efficient use of database systems (PM 52). Introduction to information retrieval (PV2AA).
information retrieval system Expert system
DBMS
Database system Management system
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Notions and classification of IS (T)IS classification Information retrieval systems
Mini questionnaire
Notions and classification of IS (T)IS classification Information retrieval systems
Class
sification and formalization of IRS
rieval systems (IRS)—principles
© What do you expect from this course? What was your motivation to enroll? Is the planned syllabus fine? Any changes or surprises?
© What do you not expect (you would rather eliminate)?
© Which related courses have you already passed?
© Practising IS usage (as a user)
a) Which (T)IS do you use?
b) Intensity? Frequency? How many searching per month?
c) Are you satisfied with it?
© IS creation (server)
a) Which (T)IS and its component have you realized? Area, size?
b) Are you satisfied with it? Bottlenecks?
Fetr Sojka
Notions and classification of IS (T)IS classification Information retrieval systems
FV030 Textual Information Systems Classification and formalization of IRS
DB I
Definition of	
documents   —>	
Output files format	
definition   —>	SEARCH
Search method	ENGINE
definition   —>	
Content of	
documents   —>	
T
Queries
Set of
selected
documents
	DB	
	t	
Query —	*   Search —	—>   Set of selected
	engine	documents
Fetr Sojka
Notions and classification of IS (T)IS classification Information retrieval systems
FV030 Textual Information Systems Classification and formalization of IRS
rmalization of the problem
Concatenation: string of beads. A bead —> an element. Indexing of elements by natural numbers. Not necessarily numbers, but labels.
0) Every element has unique label.
A) Every labeled element x (except for the leftmost one) has a clear predecessor referred to as pred(x). Every labeled element x (except for the rightmost one) has a clear successor referred to as succ(x).
3) If the element x is not the leftmost one, x = succ(pred(x)).
4) If the element x is not the rightmost one, x = predieuccix)).
5) For every two different elemente x and y, there exists a positive number k that is either x = eucck(y) or
X =
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Notions and classification of 15
(T)IS classification Classification and formalization of IRS
Information retrieval systems
Searching—formalization of the problem (cont.)
The concatenation term:
Definition: a string is a set of elements which meets the rules 0)-5).
Definition: a linear string: a string that has a finitely many elements including the leftmost and rightmost ones.
Definition: a necklace.
Definition: an alphabet A. Letters of the alphabet. A+. An empty string e.
Definition: a Unite chain A* = A+ U {e}. Definition: a linear string over A: a member of A+. Definition: a pattern. A text.
© Classification according to the passing direction: left-to-rig ht/right-to-left.
© Classification according to (pre)processing of the text and the pattern:
• ad fontes (searching in the text itself);
• text surrogate (searching in the substitution of the text);
• substitutions:
an index: an ordered list of significant elements together with references to the original text; a signature: a string of indicators that shows the occurrence of significant elements in the text.
Fetr Sojka
Notions and classification of IS (T)IS classification Information retrieval systems
cation (cont.)
Clasi
FV030 Textual Information Systems Classification and formalization of IRS
	text	^reprocessing	
		no	yes
pattern preprocessing	no	1	III
	yes	II	IV
I - elementary algorithms
II - creating a search engine
III - indexing methods
IV - signature methods
Fetr Sojka
Notions and classification of IS (T)IS classification Information retrieval systems
FV030 Textual Information Systems Classification and formalization of IRS
e formulation of the problem
Classification according to the cardinality of the patterns' set:
© Search for a single pattern V in the text T. The result: yes/no.
© Search for a finite set of patterns P = {i^, 1/2,..., v^}. The result: information about position of some of the entered patterns.
© Search for an infinite set of patterns assigned by a regular expression K. K defines a potentially infinite set L(K). The result: information about position of some of the patterns from L(K).
Alternatives to the formulation of the searching problem:
a) the first occurrence;
b) the all occurrences without overlapping;
c) the all occurrences including overlapping.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
I. SE without preprocessing both patterns and the text
II - Exact search with query preprocessing Rudimentary search algorithm
K.arp-Rabin search algorithm
Na'iVe search, brute force search, rudimentary search algorithm
Exact search
FetrSojka PV030 Textual Information Systems
I. SE without preprocessing both patterns and the text
II - Exact search with query preprocessing Rudimentary search algorithm
K.arp-Rabin search algorithm
Time complexity analysis of naive search
proc Brute-Force-Matcher(PATTERN,TEXT): T:=length[TEXT]; P:=length[PATTERN]; for i:=0 to T-P do
if PATTERN[1..P]=TEXT[i+l..i+P]
then print "The pattern was found at the position
Fetr Sojka
. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
Igorithms
FV030 Textual Information Systems
mentary search algorithm
9 The complexity is measured by number of comparison, the length of a pattern P, the length of text T.
• The upper estimate S = P • (T - P + -1), thus 0(P x 7").
• The worst case PATTERN = ap~^b, TEXT = ar"1b.
• Natural languages: (average) complexity (number of comparison) substantially smaller, since the equality of prefixes doesn't occur very often, for English: S = Cb ■ (T — P + A), Cb empirically measured A.07, i.e. practically linear.
9 CCz?    CCz vs. CE?
• Any speedups?   An application of several patterns?   An infinite number?
• We will see the version (S, Q, Q') of the algorithm in the seminar.
express the time complexity of the following search algorithms using the variables c and s, where c is the number of the tests and these statements are true:
• if the index / is found, then c = / and s = A;
9 otherwise, c = T and s = 0.
FetrSojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
	1. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm	Rudimentary search algorithm
Na'	we search—algorithm 5	
input:    var TEXT : array [1..T] of word;
PATTERN : word; output (in the variable FOUND): yes/no
A I:=l;
c    while I< T do begin
c if TEXT[I]=PATTERN then break;
c-s inc(I) ;
end;
2    FOUND:=(I<T);
On the left side, there is the time complexity of the statements.
And so the overall time complexity is 0(7") = 3c — s + 3.
The maximum complexity (which is commonly stated) is 0(7") = 37" + 3.
. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
Rudimentary search algorithm
how about using the end stop/skid (zarážka)
c
c-1 2
input: var TEXT : array [1..T+l] of word; PATTERN output (in the variable FOUND): yes/no
I:=l;
TEXT [T+l]:=PATTERN; while TEXT [I] OPATTERN do
inc(I); FOUND : = (K>T+1)
word;
In this case, the index is always found; therefore it is stated on the last but one line of the algorithm that the complexity is c — A instead of c — 5 (although they are equivalent). Furthermore, it is necessary to realize that the maximal possible value of c is greater by one than in the previous algorithm (stating c + A instead of c would not be correct, though). The overall complexity: 0(7") = 2c + 3. The maximum complexity: 0(7") = 27" + 5.
Fetr Sojka
. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
FV030 Textual Information Systems Rudimentary search algorithm
r how about using the cycle expansion
input: var TEXT : array [1..T+l] of word; PATTERN : word;
output (in the variable FOUND): yee/no
A I:=l; A TEXT[T+1]:=PATTERN;
[c/2] while TEXT [I] OPATTERN do
begin
[c/2j if TEXT[1+1]=PATTERN then break;
L(c-1)/2J I:=I+2;
end;
3 FOUND: = (KT) or (TEXT [T] =PATTERN) ;
The overall complexity: 0(7") = c + [_{c — A)/2] +5.
The maximum complexity: 0(7") = 7" + [X/2] + 6.
The condition at the end of the algorithm guarantees its functionality
(however, it is not the only way of handling the cycle incrementation by
two).
Fetr Sojka
. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
FV030 Textual Information Systems Rudimentary search algorithm
© Watson
© Exact search methods I (without pattern preprocessing) -completion.
© Exact search methods II (with pattern preprocessing, left to right): KMP (animation), Rabin-Karp, AC.
© Search with an automaton.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
I. SE without preprocessing both patterns and the text
II - Exact search with query preprocessing Rudimentary search algorithm
K.arp-Rabin search algorithm
Evaluation of questionnaire
I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
© Yes: syllabus suits expectations; positively is awaited dissect of Google; indexing and search; examples.
© No: too much theory, deep digestion of algorithms.
© Examples.
© This year: further enrichment of information retrieval part (Google), textual (mathematical) digital libraries and languages enhancements of TIS (on the example of Watson).
© Search in text editor (Vim, Emacs), in the source code of a web page.
© Data search (biological molecules approximated as sequences of nucleotides or amino acids).
© Literature/abstracts search—recherche, corpus linguistics.
The size of available data doubles every AQ> months (Moore's law) —> higher effectiveness of algorithms needed.
	Fetr Sojka	FV030 Textual Information Systems
	1. SE without preprocessing both patterns and the text	
	II - Exact search with query preprocessing	
	K.arp-Rabin search algorithm	
Lei	:t-to-right direct search methods	
Fetr Sojka
I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
:t-to-right methods
al Information Systems
During the preprocessing, structure of the query pattern(s) is examined and, on that basis, the search engine is built (on-the-fly).
Definition: exact (vs. fuzzy (proximitni)) search aims at exact match (localization of searched pattern(s)).
Definition: left-to-right (LR, eouemerne) (vs. right-to-left (RL, protiemerne)) search compares query pattern to the text from left to right (vs. right to left).
© A query pattern (vzorek):
• Shift-Or algorithm.
• Karp-Rabin algorithm, (KR, A9&7).
• Knuth-Morris-Pratt algorithm, (KMP, designed (MP) in ■1970, published -1977).
© n patterns: Aho-Corasick algorithm, (AC, -1975).
© co patterns: construction of a search engine (finite automaton) for the search of a potentially infinite set of patterns (given as regular expression).
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
Igorithm
, SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm
algorithm (cont.) - example
if V] = a j
is1 Pattern v^V2...vm over an alphabet I = a-i,
is1 Incidence matrix X (m x c), X, = •< ?
J     ^ 1 otherwise.
is1 Let matrix column X corresponding to aj is named
is1 At the beginning, we put unitary vector/column into K. In every algorithm, step K moves down by one line/position, top-most position is filled by zero and one character aj is read from input. Resulted K is combined with Aj by binary disjunction: K := SH[FT(R)ORAj.
is1 Algorithm stops successfully when 0 appears at the bottom-most position in K.
Example: V = vzorek over X = \e, k, o, r, v, z}. Cf. [POK, page 31-32].
Fetr Sojka
I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Karp-Rabin search algorithm
FV030 Textual Information Systems
Fetr Sojka
, SE without preprocessing both patterns and the text II - Exact search with query preprocessing Karp-Rabin search algorithm
FV030 1
Textual Information Systems
rch (cont.)—implementation
Quite different approach: usage of hash function. Instead of matching of pattern with text on every position, we check the match only when pattern 'looks similar' as searched text substring. For similarity, a hash function is used. It has to be
is1 efficiently computable,
is1 and it should be good at separating different strings (close to perfect hashing).
KK search is quadratic at the worst case, but on average 0(T + V).
#define REHASH (a, b, h)  (((h-a*d) «l+b) void KR(char *y, char *x, int n, int m) { int hy, hx, d, i;
/* preprocessing: computation of d = 2m~A */ d=l; for (i=l; i<m; i++) d«=l; hx=hy=0;
for (i=0; i<m; i++)
{ hx=((hx«l)+x[i]); hy=((hy«l)+y[i]); } /* search */ for (i=m; i<=n; i++) {
if  (hy==hx) kk strncmp(y+i-m,x,m)==0) OUTPUT(i-m);
hy=REHASH(y[i-m], y[i], hy);
} }
Fetr Sojka FV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Karp-Rabin search algorithm
ch (cont.)—example
(n)MF
Search engine (finite automaton) Construction of the K.MP engine
Exact search of one pattern
Example: ([HCS, Ch.6]) V = ing, T = string matching. Preprocessing: hash = -105 x 22 + -/M0x2 + -103 = 743. Search:
T=       s       t       r       i       n g
hash= &06   797   776   743 67&
match       i       n g 5&5   443   746   719   766   709   736 743
Fetr Sojka (K)MP
Search engine (finite automaton) Construction of the K.MP engine
■ algorithm (MP)
FV030 Textual Information Systems
Fetr Sojka (K)MF
Search engine (finite automaton) Construction of the K.MP engine
FV030 Textual Information Systems
of the (K)MP algorithm
Idea: Inefficiency of naive search are caused by the fact that in the case of mismatch the pattern is shifted by only one position to the right and checking starts from the beginning. This does not use the information that was gained by the inspection of text position that failed. The idea is to shift as much as possible so that we do not have to go back in searched text.
var text: array[A..T] of char, pattern: arra\/\A..V~\ of char, i, j: integer; found: boolean;
i := A; > text index
j := A; > pattern index
while (/ < T) and (j < V) do
while (j > 0) and [text\J] ^ pattern[j]) do j ■■ h[j];
end while
i:=i + A; j:=j + A end while
found := j > V; > if found, it is on the position / — V
Fetr Sojka FV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
(K)MP
Search engine (finite automaton) Construction of the K.MP engine
(K)MP
Search engine (finite automaton) Construction of the K.MP engine
?-Pratt algorithr
"S" 0(7") complexity plus complexity of preprocessing (creation of the array h).
is1 Animation of tracing of the main part of KMP.
is1 h is used when prefix of pattern v^v2 ■ ■ ■ Vj-i matches with
substring of text t,_j+-|t,_j+2 • • • t,--i and Vj ^ t,-. is1 May I shift by more than A? By j? How to compute h?
is1 f)(j) the biggest k < j such that v^v2 ■ ■ ■ ^--i is suffix of
v^v2... Vj-i, e.g. v^v2... vk-^ = Vj-MVj-k+2 ...      and Vj ^ vk.
is1 KMP: backward transitions for so long, so that j = 0 (prefix of pattern is not contained in the searched text) or t\ = Vj
is1 Animation Lecroq, also [POK, page 27], also see [MAR] for detailed description.
Petr Sojka (K)MF
Search engine (finite automaton) Construction of the K.MP engine
of h for KMP
FV030 Textual Information Systems
Fetr Sojka (K)MF
Search engine (finite automaton) Construction of the K.MP engine
irch algorithm,
FV030 Textual Information Systems
i:=l; j:=0; h[l]:=0; while (i<V) do
begin while (j>0) and (v[i]<>v[j]) do j:=h[j]; i:=i+l; j:=j+l; if (i<=V) and (v[i]=v[j]) then h[i] :=h[j] else h[i] :=j (*MP*) end;
Complexity of h computation, e.g. preprocessing, is 0{V), thus in total 0{T + V).
Example: hfor ababa. KMPvs. MP.
that uses transition table q derived from the searched pattern, (g relates to the transition function 6 of FA):
var i,T:integer; found: boolean;
text: array[l..T] of char; state,qO: TSTATE;
g:array[1..maxstate,1..maxsymb] of TSTATE;
F: set of TSTATE;...
begin
found:= FALSE; state:= qO; i:=0; while (i <= T) and not found do begin
i:=i+l; state:= g[state,text[i]]; found:= state in F; end; end;
How to transform pattern into g?
FetrSojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
(K)MP
Search engine (finite automaton) Construction of the K.MP engine
e (SE) for left-to-right search
(n)MF
Search engine (finite automaton) Construction of the K.MP engine
SE for left-to-right search A = (Q, T, g, h, q0, F)
• Q is a finite set of states.
• 7" is a finite input alphabet.
• g: Q x 7" —> QU {fai|} is a forward state-transition function.
• fi: (Q — cjo) —> <3 is a backward state-transition function.
• cjo is an initial state.
• F is a set of final states.
"S" A depth of the state q: d{q) G No is a length of the shortest forward sequence of the state transitions from qo to q.
is1 Characteristics g, h:
• g(qo,a) ^ f§Ji^or     G 7" (there is no backward transition in the initial state).
• If h(q) = p, then d{p) < d{q) (the number of the backward transitions is restricted from the top by a multiple of the maximum depth of the state cand the sum of the forward transitions V). So the speed of searching is linear in relation to V.
Fetr Sojka (K)MF
Search engine (finite automaton) Construction of the K.MP engine
tion, transition
FV030 Textual Information Systems
Fetr Sojka (K)MF
Search engine (finite automaton) Construction of the K.MP engine
FV030 Textual Information Systems
is1 SE configuration (q, w), q G Q, w G T* the not yet searched part of the text.
is1 An initial configuration of 5E {qo, w), w is the entire searched text.
is1 An accepting configuration of 5E (q, w), q£ F,w is the not yet
searched text, the found pattern is immediate])/ before w.
SE transition: relation hC (Q x T*) x(Qx T*):
• g{q,a) = p, then (q,aw) h (p, w) forward transition for
\/w G T*.
• h(q) = p, then (q, w) h (p, tv) backward transition for
\/w G f*.
During the forward transition, a single input symbol is read and the engine switches to the next state p. However, if g{q,a) = fail, the backward transition is executed without reading an input symbol. S = 0(7") (we measure the number of SE transitions).
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
© An initial state qo-
© q{q, vj+y\) = q', where q' is equivalent to the prefix v-\ V2 ■ ■ ■ vfj+A ■
© For qo, we define q{qo, a) = qo for Va, for which q(qo, a) has not been defined in the previous step.
© q{q, a) = fajj for \/q and a, for which q{q, a) has not been defined in the previous steps.
© A state that corresponds to the complete pattern is the final one.
© The backward state-transition function h is defined on the page 51 by the below mentioned algorithm.
© Summary of the previous lecture, searching with SE.
© Left-to-right search of n patterns algorithms. (AC, NFA —> DFA.)
© Left-to-right search of infinite patterns algorithms.
© Regular expressions (RE).
© Direct construction of (N)FA for given RE.
Petr Sojka	PV030 Textual Information Systems
Search of n patterns	
Aho-Corasick algorithm	
Finite automata for searching	
Petr Sojka Search of n patterns Aho-Corasick algorithm Finite automata for searching		PV030 Textual Information Systems
	ž of patterns	
Search of a finite set of patterns
SE for left-to-right search of a set of patterns p = {vA ,v2,..., vF}.
instead of repeated search of text for every pattern, there is only "one" pass (FA).
PetrSojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Search of n patterns Aho-Corasick algorithm Finite automata for searching
rithm
Search of n patterns Aho-Corasick algorithm Finite automata for searching
thm (con1
var text: array[l..T] of char;
i: integer; found: boolean; state: tstate;
g: array[1..maxstate,1..maxsymbol] of tstate;
h: array[1..maxstate] of tstate; F: set of tstate; found:=false; state:=q0; i:=0; while (i<=T) and not found do begin i:=i+l;
while g[state,text[i]]=fail do state:=h [state];
state:=g[state,text [i]]; found:=state in F end
9 Construction of the state-transition functions h, g? • How about for P patterns? The main idea? 9 Aho, Corasick, -1975 (AC search engine).
Construction of g for AC SE for a set of patterns p = {i/1, v2,..., vp} © An initial state qo-
© g{q, bj-H) = q', where q' is equivalent to the prefix b-\bz ■ ■ ■ bj+i of the pattern v', for V/ £ {A,..., P}.
© For qo, we define g{qo, a) = qo for Va, for which g{qo, a) has not been defined in the previous steps.
© g{q, a) = fai| for \/q and a, for which g{q, a) has not been defined in the previous steps.
© A state that corresponds to the complete pattern is the final one.
An example: p ={he, she, her} over 7" ={h, e, r, s, x}, where x is anything else than {h, e, r, s}.
Construction of h for AC SE for a set of patterns p = {i/1, i/2,..., vF}
At first, we define the failure function f inductively relative to the depth of the states this way:
© For Mq of the depth A, f{q) = qo-
© Let us assume that f is defined for each state of the depth d and leeeer. The variable qo denotes the state of the depth d and g{qo,a) = q'. Then we compute f(q') as follows:
q ■= %D);
while g{q, a) = fai| do q := f(q);
Petr Sojka FV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Search of n patterns Aho-Corasick algorithm Finite automata for searching
on h (AC II, cont.)
Search of n patterns Aho-Corasick algorithm Finite automata for searching
• The cycle terminates, since g(cjo. <s>) ^ iM-0 If the states cj, r represent prefixes u, / of some of the patterns from p, then f(q) = r <=> v is the longest proper suffix u.
AMD
Haid)) /(?')
9ß
Fetr Sojka
Search of n patterns Aho-Corasick algorithm Finite automata for searching
FV030 Textual Information Systems
Fetr Sojka
Search of n patterns Aho-Corasick algorithm Finite automata for searching
<0 Textual Information Systems
onstruction of h for AC SE for a set of patterns = {/I,v2>...,vp} (cont.)
for AC SE (cont.)
We could use f as the backward state-transition function h, however, redundant backward transitions would be performed.
We define function /i inductively relative to the depth of the states this way:
o For V state cj of the depth A, h(q) = cjo
• Let us assume that h is defined for each state of the depth d and lesser. Let the depth q be d + A. If the set of letters, for which is in a state f(q) the value of the function q different from fail is the subset of the set of letters, for which is the value of the function gin a state q different from fail, then h(q) := h(f(q)), otherwise h(q) := f(q).
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Search of n patterns Aho-Corasick algorithm Finite automata for searching
Search of n patterns Aho-Corasick algorithm Finite automata for searching
■a for searchi
Deterministic finite automaton (DFA) M=(K,T,6,qo,F)
© K is a finite set of inner states. © 7" is a finite input alphabet. © 6 is a projection from K x 7" to K. © cjo £ K is an initial state. © F C K is a set of final states.
© Completely specified automaton if 6 is defined for every pair (q,a) G K x T, otherwise incompletely specified automaton.
© Configuration M is a pair (q, w), where q G K, w G 7"* is the not
yet searched part of the text.
© An initial configuration M is {qo, w), where w is the entire text to be searched.
© An accepting configuration M is (q, w), where q£ F and w G T*.
During the transition, a single input symbol is read and the engine switches to the next state p.
is1 Transition M: is defined by a state and an input symbol; relation hC (K x f*) x (K x f*); if <?(q, a) = p, then (q, aw) h (p, tv) for every Vw G 7"*.
"S" The kth power, transitive or more precisely transitive
reflexive closure of the relation h: h^, h+, h*.
m- L{M) = {w ET* : (q0, w) h* (q, w') for some q G F, w' G T*} the language accepted by FA M.
is1 time complexity 0(7") (we measure the number of transitions of FA M).
Definition: Nondeterministic finite automaton (NFA) is
M = (K, 7", 6, qo, F), where K, T, qo, F are the same as those in the
deterministic version of FA, but 6 : K x 7" —> 2K £(q, a) is now a set of
states.
Definition: hG (K x 7"*) x(Kx T*) transition: if p G c%,a), then (q, aw) h (p, w) for Vtv G T*.
Definition: a final state, L(M) analogically as in DFA.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Search of n patterns Aho-Corasick algorithm Finite automata for searching
of SE (DFA) from NFA
Search of n patterns Aho-Corasick algorithm Finite automata for searching
fSE (DFA) from NFA (cont.)
Theorem: for every nondeterministic finite automaton M=(K,T,6,qo,F), we can build deterministic finite automaton M''=(K'J\6'',q'0, F') such that L(M) = L(M/).
A constructive proof (of the algorithm): Input: nondeterministic FA M = (K, T,8, qo, F). Output: deterministic FA.
© ^'={{qo}}> state {qo} in unmarked.
© If there are in K' all the states marked, continue to the step 4. © We choose from K' unmarked state q':
9 6'(q',a) = \]{S(p, a)}for,Vp G q' and a G 7";
• K' = K'U<?'(q',a)forVa G 7";
• we mark q' and continue to the step 2.
® ^ = {*};F' = {<eK': <nF^0}.
Fetr Sojka
Search of n patterns Aho-Corasick algorithm Finite automata for searching
FV030 Textual Information Systems
Fetr Sojka	FV030 Textual Information Systems
Left-to-right methods	
Derivation of a regular expression	
Characteristics of regular expressions	
of g for SE
Search for an infinite set of patterns
Construction g' for SE for a set of patterns p = {i/1, i/2,..., vF}
© We create NFA M:
9 An initial state qo-
9 For Va G 7", we define q(qo, a) = q<>
• For V/ G {-1,..., P}, we define q(q, bj+^) = q', where q' is
equivalent to the prefix b-\bz ■ ■ ■ bj+i of the pattern v'. 9 The state corresponding to the entire pattern is the final
one.
© .. .and its corresponding DFA M' with q'.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Regular expression (RE)
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Value of RE
Definition: Regular expreeeion E over the alphabet A:
© 6,0 are RE and for Va £ A is a RE. © If x, y are RE over A, then:
• (x + y) 15 (union);
• (x.y) is RE (concatenation);
• (x)* is RE (iteration).
A convention about priority of regular operations: union < concatenation < iteration.
Definition: Thereafter, we consider as a (generalized) regular
expreeeion even those terms that do not contain, with regard to this convention, the unnecessary parentheses.
Petr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
PV030 Textual Information Systems
n of RE (Salomaa A966)
M A2 A3 A4 A5 A6 A7 AO A9 MO MA
x + (y + z) = (x + y) + z = x + y + z   aeeociativity of union
x.(y.z) = (x.y).z = x.y.z   aeeociativity of concatenation
x + y = y + x   commutativity of union
(x + y).z = x.z + y.z   right dietributivity
x.(y + z) = x.y + x.z   left dietributivity
x + x = x   idempotence of union
6.x = x   identity element for concatenation
O.x = 0   inveree element for concatenation
x + 0 = x   identity element for union
x* = e + x*x
x* = (e + xY
© h{0) = 0,      = {g}, h{a) = {a}
©    • f)(x + y) = f)(x) U f)(y)
• f)(x.y) = /i(x)./i(y)
• f,(x*) = (f)(X))*
car f)(x*) = e U x U x.x U x.x.x U ...
is1 The value of RE is a regular language (RL).
is1 Every RL can be represented as RE.
«■ For V RE V 3 FA M: f)(V) = L(M).
Petr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
k four)
PV030 Textual Information Systems
© Summary of the previous lecture.
© Regular expressions, value of RE, characteristics.
© Derivation of regular expressions.
© Direct construction of equivalent DFA for given RE by derivation.
© Derivation of regular expressions by position vector.
© Right-to-left search (F3MH, ON, BUC).
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
egular expressio
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
regular expressi
Theorem: the axiomatization of RE is complete and consistent.
Definition: regular expressions are termed as similar, when they can be mutually conversed using axioms M to MA.
Theorem: similar regular expressions have the same value.
Definition: the length d(E) of the regular expression E:
© If £ consists of one symbol, then d(E) = A. © d{VA + V2) = d{VA) + d(V2) + A. © d{VA.V2) = d{VA) + d{V2) + A. © d{V*) = d{V) + A. © d((V)) = d(V) + 2. Note: the length corresponds to the syntax of a regular expression.
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
of N FA for given RE
FV030 Textual Information Systems
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
FV030 Textual Information Systems
n of N FA for given RE (a proof)
© E = a
Definition: a generalized NFA allows s-transitions (transitions without reading of an input symbol).
Theorem: for every RE E, we can create FA M such that f)(E) = L(M). Proof: by structural induction relative to the RE E:
E = E\ automaton for E^ (/i(E^) = L(M^))
6
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
f N FA for given RE (cont. of a proof)
n of NFA for given RE (cont.)
E = E-i • E2
© E = E-i + £2   M-i, M2 automata for E-i, E2 (/i(£-i) = ^
KB2) = L(M2))
is1 No more than two edges come out of every state.
is1 No edges come out of the final states.
is" The number of the states M < 2 • d(£).
is1 The simulation of automaton M is performed in 0(d(E)T) time and in 0(d(E)) space.
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
FV030 Textual Information Systems
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
ion (cont.)
FV030 Textual Information Systems
For the following methods of NFA simulation, we must remove the s-transitions. We can achieve it with the well-known procedure:
We represent a state with a Boolean vector and we pass through a the paths at the same time. There are two approaches:
is1 The general algorithm that use a transition table.
is1 Implementation of the automaton in a form of (generated) program for the particular automaton.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
ction of (N)FA for given RE
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
ruction of (N)FA for given RE (cont.)
Let £ is a RE over the alphabet T. Then we create FA M = (K, 7", c7, cj0, F) such that h(E) = L(M) this way:
© We assign different natural numbers to all the occurrences of the symbols of T in the expression E. We get £'.
© A set of starting symbols Z = {x; : a string of /i(E') can start with the symbol x;, x; ^ s}.
© A set of neighbours P = {x,yj : symbols x; ^ s ^ y,- can be next to each other in a string of /i(E')}.
© A set of ending symbols F = {x; : a string of /i(E') can end with the symbol x; ^ s}.
© A set of states K = {q0}UZU {yj : x.-y,- € P}. © A transition function 6:
• ^(q0>x) contains x,for,Vx; G Z that originate from numbering of x.
• <?(x,-,y) contains yjfor,Vx,yj G P such that y,- originates from numbering of y.
© F is a set of final states, a state that corresponds to E is q0.
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
a regular expression
FV030 Textual Information Systems
Definition: derivation ^| of the regular expression E by a string x G T*:
© ^E.
© For a G T, these statements are true: —   = O
db da
d{E + F) da
d(E.F) da
djn
da
O   if a ^ b
s    if a = b
dE dF
f dE dF — ■ F+ —
L
dE
if s G /i(E) otherwise
E*
Example -1: K = ab*a + ac + b*ab*. Example 2:K = ab* + ac + b*a.
	Fetr Sojka PVOZ	JO Textual Information Systems
	Left-to-right methods	
	Derivation of a regular expression	
	Characteristics of regular expressions	
Derivat	;ion of a regular expression	(cont.)
For x = a^a2 ■ ■■an, a, £ T, these statements are true
d£ d
dx    dan \ dan
d
d f dB da2 \ da^
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
sties of regular expressions
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
ruction of DFA for given RE (by RE derivation)
Example: Derive E = f/ + f/* + f*ifi by / and f. Example: Derive (o*sle)*cno by o, s, I, c and osle.
Theorem: h (^jf) = {y : xy G /i(E)}.
Example: Prove the above-mentioned statement. Instruction: use structural induction relative to E and x.
Definition: Regular expreeeione x, y are eimilar if one of them can be transformed to the other one with axioms of the axiomatic theory of RE (Salomaa).
Example: Is there a RE similar to E = fi ■ A5?
fi* + f*ifi that has length 7,
Fetr Sojka	FV030 Textual Information Systems
Left-to-right methods	
Derivation of a regular expression	
Characteristics of regular expressions	
Example: RE= K = (0 + A)*A.
Q = Q0 = {(p + A)*A},i = A
Q^={% = ^f} = W + A)*A+£j
Q2 = = K,£±f^ = (0 + 1)M +s} = 0
Example: RE= (10)*(00)M.
For more, see Watson, B. W.: A taxonomy of finite automata construction algorithms, Computing Science Note 93/43, Eindhoven University of Technology, The Netherlands, -1993. citeseer.ist.psu.edu/watson94taxonomy.html
Brzozowski (A964, Journal of the ACM) Input: RE E over T.
Output: FA M = (K, T, 5, q0, F) such that f)(E) = L(M). Q Let us state Q = {£}, Q0 = {£}, / := A.
Q Let us create the derivation of all the expressions of Q,_^ by all the symbols of T. Into Q,, we Insert all the expressions created by the derivation of the expressions of Q,_^ that are not similar to the expressions of Q.
Q If Qj ^ 0, we Insert Q, Into Q, set i := / + A a move to the step 2.
O ForV^ EQanda ET, we set 6 (%,a) = 4z7, in case that the
dF
df_ dx" dF
dx
expression ^ is similar to the expression dxa.
dF_
dx'
(Concurrently
e q:
O The setF={f E Q : e e h (%)}.
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Example : let us have a set of the patterns P= {tis, ti, iti}: is" Create NFA that searches for P.
is1 Create DFA that corresponds to this NFA and minimize it. Draw the transition graphs of both the automata (DFA and the minimal DFA) and describe the procedure of minimization.
is1 Compare it to the result of the search engine SE.
is1 Solve the exercise using the algorithm of direct construction of DFA (by deriving) and discuss whether the result automata are isomorphic.
FetrSojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
of RE by position vector
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
f RE by position vector
Definition: Position vector is a set of numbers that correspond to the positions of those symbols of alphabet which can occur in the beginning of the tail of the string that is a part of the value of the given RE.
Example: let us have a regular expression:
a    .    b*    .    c (1) To denote the position, we are going to use the wedge symbol A. So the expression (1) is represented as:
%    ■    b*    ■    C (2)
By deriving a denoted expression, we get a new denoted regular expression. The basic rule of derivation is this:
Q If the operand, by which we derive, is denoted, then we denote the positions right after this operand. Subsequently, we remove its denotation. It means that, by deriving the expression (2) by the operand a, we get: a   ■ ■   c (3a)
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
of RE by position vector
FV030 Textual Information Systems
Since the construction, which generates also the empty string, is denoted, we denote the following construction as well:
b* A
c A
Now, by deriving by the operand b of the expression (3b), we get:
a    . b*
c A
Q Since the construction following the construction in iteration is denoted, the previous constructions have to be also denoted.
a    .    b*    . c
A A
By deriving the expression (4b) by the operand c, we get: a    .    b*    . c
A
When a regular expression is denoted this way, it corresponds to empty regular expression e.
Fetr Sojka
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
<0 Textual Information Systems
f RE by position vector, an example
For every syntactic construction, we make a list of the starting positions at the initials of the members.
If a construction symbol equals to the symbol we use for deriving, and it is located in the denoted position, then we move the denotation in front of the following position.
If an iteration operator is located after the construction, and the denotation is at the end of the construction, then we append the list of the starting positions, which belong to this construction, to the resulting list.
If the denotation is located before a construction, then we append the list of the starting positions of this construction to the resulting list.
If the denotation is before the construction which generates also an empty string, then we append the list of the starting positions of the following construction to the resulting list.
When we want to denote a construction inside parentheses, we must denote all the initials of the members inside the parentheses.
Example: ab*.c, derived by a, b, c.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Right-to-left search of one pattern
Right-to-left search of one pattern
Right-to-left search
FetrSojka PV030 Textual Information Systems
Right-to-left search of one pattern
boyer-Moore-Horepool algorithm
Right-to-left search—principles.
Could the direction of the search be significant?
In which cases?
is1 one pattern—Boyer-Moore (BM, 1977), Boyer-Moore-Horspoo (BMH, 1900), Boyer-Moore-Horspool-Sunday (BMHS, 1990)
"S" n patterns—Commentz-Walter (CW, 1979)
is1 an infinite set of patterns: reversed regular expression—Bucziiowski (BUC)
Fetr Sojka FV030 Textual Information Systems
Right-to-left search of one pattern
-1
2: 3: 4: 5: 6: 7: Ô: 9: -10: -1-1 -12: -13: -14:
var: TEXT: array[-1..T] of char;
PATTERN: array[-1..P] of char, [J: integer, FOUND: boolean; FOUND :=■ false;    1 :=■ F; while (/ < T) and not FOUND do J:=0;
while (J < P) and {PATTERN[P - J] = TEXT[I - J]) do
J:= J + -1; end while
FOUND := (J = P);
if not FOUND then
/ := / + 5H1FT(TEXT[1 - J], J) end if end while
When is it faster than KMP? When 0{T/P)? The time complexity 0(7" + P).
Example: searching for the pattern BANANA in text I-WANT-T0-FLAVOR-NATURAL-BANANAS.
SHIFT(A,J) = if A does not occur in the not yet compared part of the pattern then P - J else the smallest 0 < K < P such that PATTEPN[P - (J + K)] = A;
FetrSojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Right-to-left search of one pattern
Right-to-left search of one pattern
The Idea: AC + right-to-left search (BM) [1979]
const LMIN=/the length of the shortest pattern/ var TEXT: array [1..T] of char; I, J: integer;
FOUND: boolean;    STATE: TSTATE;
g: array [1..MAXSTATE,1..MAXSYMBOL]  of TSTATE;
F: set of TSTATE; begin
FOUND:=FALSE; STATE:=q0; I:=LMIN; J:=0; while (K=T) & not (FOUND) do begin
if g [STATE, TEXT [I-J] ]=f ail then begin I:=I+SHIFT[STATE, TEXT [I-J]]; STATE:=q0; J:=0;
end
else begin STATE:=g[STATE, TEXT[I-J]];  J:=J+1 end FOUND:=STATE in F end end
Fetr Sojka PV030 Textual Information Systems
Right-to-left search of one pattern
unction shift
Definition: shift[5TATE, TEXT[{ - JJ] = min {A, shift2{5TATE)},
where A = max {shiftA[STATE),char{TEXT[l — JJ) — J — A}.
The functions are defined this way:
Q char(a) is defined for all the symbols from the alphabet T as the least depth of a state, to that the CM search engine passes through a symbol a. If the symbol a is not in any pattern, then char(a) = LMIN + A, where LMIN is the length of the shortest pattern. Formally: char(a) = min -[LMIN + A, m\n{d{q)\ w(q) = xa, x G 7"*}}-
Q function shiftA (q0) = 'I; for the other states, the value is
shiftA (q) = min {LMIN, A}, where A = min{fc|   = d(q') — where is its own suffix w(q') and a state    has higher depth than q}.
@ Function shift2(q0) = LMIN; for the other states, the value is
shift2(q) = min{A, B}, where A = min{fc|   = d(q') —       where       is a proper suffix w(q') and    is a final state}, B = shift2(q')\ q' is a predecessor of
of the CW search engine
INPUT: a set of patterns P = {i^,^. • • • >vk} OUTPUT: CW search engine
METHOD: we construct the function q and introduce the evaluation of the individual states w:
O An initial state qo; w{qo) = £■
Q Each state of the search engine corresponds to the suffix bmbm+^ .. .bn of a pattern v, of the set P. let us define q{q, a) = q', where q' corresponds to the suffix abmbm+^ ... bn of a pattern v,: w{q) = bn... bm+^bm; w{q') = w{q)a.
O q{q,a) = fajlfvr every q and a, for which q{q, a) was not defined in the step 2.
Q Each state, that correspond to the full pattern, is a final one.
Fetr Sojka FV030 Textual Information Systems
Right-to-left search of one pattern
Example: P = {cacbaa, aba, acb, acbab, ccbab}.
LMIN = 3,-
	a	b	c	X
char	A	A	2	4
Hi)	shiftA	s/i/ft2
s	A	3
a	A	2
b	A	3
aa	3	2
ab	A	2
be	2	3
ba	-1	-1
aab	3	2
aba	3	2
bca	2	2
bab	3	-1
aabc	3	2
babe	3	-1
aabca	3	2
babca	3	-1
babec	3	-1
aabcac	3	2
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
Search for an infinite set of patterns
Petr Sojka PV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
Right-to-left search for an inf. set of patterns (cont.)
© Right-to-left search of an infinite set of patterns
© Two-way jump automaton - a generalization of the so far learned left-to-right and right-to-left algorithms.
© Hierarchy of the exact search engines.
Petr Sojka PV030 Textual Information Systems
Definition: reversed regular expression is created by reversion of al concatenation in the expression.
Example: reversed RE for £ = bc(a + a*bc) is £K = (a + cba*)cb:
Bucziiowski: we search for E such that we create EK and we use it for determination of shift[5TATE, SYMBOL} for each state and undefined transition analogically as in the CW algorithm:
	a	b	c	X
0		1		3 •
1	1		1	2(3!)
2		1		
3	1		1	1
4	1	1	1	1
5	1	1		1
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
aton
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
omaton II
Definition: 2DFA5 is M = (Q, I, <?, cj0, k, T. 0> where Q a set of states Z an input alphabet
5 a projection. QxI^Qx {—1,-1.....k}
cjo G Q an initial state
fe G N max. length of a jump Q U Z a jump symbol
F C Q a set of final states Definition: a configuration of 2DFA5 is a string of Z*QZ* | 2*. Definition: we denote a set of configurations 2DFA5 M as K(M).
Example:   £H a2 • • •c\a\...a^ \ aj...a„ G K(M) :
čtecí hlava
L
značka skoku
Definition: a transition of2DFA5 is a relation h C K(M) x K(M) such that
"3° a-i ... a-j-^a-, q      ... a^ \ aj...an \~ a-\ ...      o{        ■ ■ ■ aj-<\ \ aj... an for i > A,5(q, a-l+^) = {c(, —A) (right-to-left comparison),
»3° a-i .. .a, cj      ... \ a^.. .a„ V- a^ ...        ... at_-i cj' \ at...an for
S(q,aj+^ = (q/,m), m>A,   t = min{;' + m,n + A} (right-to-left jump),
is* a-\ .. .aj q aj+i ...\ a-,.. .a„ ha^ ...ajaj+^ ...at-i o[ \ at.. .a„ for fi(q,aj) = (q/,m),    m>A,   t = min{/ + m,n + A} (left-to-right jump),.
"3°  3^ ... __| CJ <9j . . . t <9/<9/-H • • • an l~       ■ ■ ■ °[ aj ■ ■ ■ T
ai+^ ... an for i > A, 5{q, af) = {c(,A) (left-to-right comparison).
(Left-to-right rules are for the left-to-right engines and vice versa.) Definition: \-k, h* analogically as in the SE.
Fetr Sojka
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
erarchy
FV030 Textual Information Systems
Fetr Sojka
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
Definition: the language accepted by the two-way automaton
M = (Q, I, & q0, k, T, F) is a set L(M) = {tv G I* : q0 T T" h* n/fxtv T, where f £py £ I*,x G I}.
Theorem: L(M) for 2DKA5 M is regular.
Example: formulate a right-to-left search of the pattern BANANA in the text l-WANT-TO-FLAVOUR-NATURAL-BANANAS using F3M as 2DFAS and trace the search as a sequence of configurations of the 2DFAS.
Let us have a regular expression K = A (0 + A* 02) over the alphabet A = {0,1,2}.
is1 Construct a right-to-left DFA K (Bucziiowski) and compute the failure function. Draw the transition graph of this automaton including the failure function visualization.
is1 Express the resulting automaton as 2DFAS and trace searching in the text A A 20A OA 2A 02.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
exact search
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
2DFAS
DFA	BUC	CO
i	i	
AC	cw	k
i	i	
KMP	BM	1
direction — # of patterns.
© Fuzzy (proximity) search. Metrics for measurement of distance of strings.
© Classification of search: 6D space of search problems.
© Examples of creation of search engines.
© Completion of the chapter about searching without text preprocessing.
© Indexing basics.
Fetr Sojka PV030 Textual Information Systems
Fuzzy search: metrics Classification of search problems
Fetr Sojka FV030 Textual Information Systems
Fuzzy search: metrics Classification of search problems
Metrics (for proximity search)
Proximity search
How to measure (metrics) the similarity of strings?
Definition: we call d : 5 x 5 —> K metrics if the following is true:
O d{x,y)>0
O d{x, x) = 0
O d(x,y) = d(y,x) (symmetry) Q d(x, y) = 0 => x = y (identity of indiscernibles) O d(x, y) + ^(y, z) > d(x,z) {triangle inequality) We call the values of the function d {distance).
Fetr Sojka FV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Fuzzy search: metrics Classification of search problems
Definition: let us have strings X and Y over the alphabet X. The minimal number of editing operation for transformation X to Y is
is1 Hamming distance, ^-distance, when we allow just the operation Replace,
is1 Levenshtein distance, DIP-distance, when we allow the operations Delete, Insert and Replace,
"S" Generalized Levenshtein distance, DIRT-distance, when we allow the operations Delete, Insert, Replace and Transpose. Transposition is possible at the neighbouring characters only.
They are metrics, Hamming must be performed over strings of the same length, Levenshtein can be done over the different lengths.
Petr Sojka PV030 Textual Information Systems
Fuzzy search: meines SFO£CO,QFO£CO,SSO£CO Classification of search problems QSOECO, SFORCO, SFODCO
Classification of search problems
Definition: Let 7" =       • • • t„ and pattern P = p^p2 ■ ■ ■ pm. For example, we can ask:
O is P a substring of 7"?
O is P a subsequence of 7"?
Q is a substring or a subsequence P in 7"?
O is P in T such that D(P, X) < k for k < m, where X = t,... tj is a part of T (0 is K, DIP or DIPT)?
O is a string P containing don't care symbol 0 {*) in T?
Q is a sequence of patterns P in T?
Furthermore, the variants for more patterns, plus instances of the search problem yes/no, the first occurrence, all the overlapping occurrences, all the also non-overlapping occurrences.
example: Find such an example of strings X and Y, that simultaneously holds P{X, Y) = 5, D\P{X, Y) = 5, and D\PT{X, Y) = 5, or prove the non-existence of such strings.
Example: find such an example of strings X and Y, that holds simultaneously P{X, Y) = 5, D\P{X, Y) = A, and D\PT{X, Y) = 3, or prove the non-existence of such strings.
Example: find such an example of strings X and Y of the length 2n, neN, that P{X, Y) = 2n and a) DIP{X, Y) = 2;   b) D\PT[X, Y) = f
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Fuzzy search: metrics Classification of search problems
SFOECO, QFOECO, SSOECO 3S0EC0, SFORCO, SFODCO
on of search problems (cont.)
Fuzzy search: metrics Classification of search problems
SFOECO, QFOECO, SSOECO QSOECO, SFORCO, SFODCO
SE creation
Dimension	1	2	3	4	5	6
	5	F	0	E	C	0
	Q	S	F		D	S
			I	D		
				G		
In total 2x2x3x4x2x2 = 192 search problems classified in a six-dimensional space.
For example, SFO??? denotes all the SE for search of one (entire) string.
For all these problems, we are going to learn how to create NFA for searching.
Example: let P = p^p2P5 ■ ■ ■ Pm> m = 4, A is any character of Z.
NFA for SFOECO: A
Petr Sojka PV030 Textual Information Systems
Fuzzy search: metrics Classification of search problems
FOECO, QFOECO, SSOECO SOECO,SFORCO, SFODCO
eec\uence of characters
Example: NFA for QFOECO (seQuence):
A p2 p3
p is any character of Z except for p. Automaton has m + A states for a pattern of the length m.
A
Fuzzy search: metrics Classification of search problems
) Textual Information Systems
ECO, QFOECO, SSOECO IECO, SFORCO, SFODCO
substring: NFA for SSOECO
Definition: This automaton is called initial e-treelie and
has (m + A) + m + (m - A) + • • • + 2 = IliH^l states.
Fetr Sojka FV030 Textual Information Systems
Fetr Sojka FV030 Textual Information Systems
Fuzzy search: metrics Classification of search problems
subsequence
FOECO, QFOECO, SSOECO ECO, SFORCO, SFODCO
Fuzzy search: metrics Classification of search problems
Example: NFA for QSOECO is similar, we just add some cycles for non-matching characters and e transitions to all the existing forward transitions [or we concatenate the automaton m-times).
Definition: Automaton for QSOECO is called e-treelie.
SFOECO, QFOECO, SSOECO QSOECO, SFORCO, SFODCO
arch of SFORCO
Example: SFORCO for Hamming (/\) distance, /c < 3: the number of the levels corresponds to the distance.
	Fetr Sojka	PV030 Textual Information Systems
	Fuzzy search: metrics	SFOECO, QFOECO, SSOECO
	Classification of search problems	QSOECO, SFORCO, SFODCO
Proximi	ty search of SFORCO	
Definition: This automaton is called R-treelie, and has
(m + A) + m + (m - A) + ■ ■ ■ + (m - k + A) = (k + A)(m + A - ^) states.
The number of the level of the final state corresponds to the length of the found string from the pattern.
A
Fuzzy search: metrics Classification of search problems
PV03Í
SFOE( QSOE
<0 Textual Information Systems
CO, QFOECO, SSOECO ICO, SFORCO, SFODCO
arch of SFODCO for DlR-dletance
Example: SFOD3CO for Levenshtein (DIR) distance, k < 3: additional "D-edges" and "/-edges".
Pi P? P^ P4
Petr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Fuzzy search: metrics SřOECO, QFOECO, SSOECO
Claeeification of search probléme QSOECO, SFORCO, SFODCO
SFOGCO
Fuzzy search: metrice Classification of search problems
Outline (Week five)
CO, QFOECO, SSOECO SCO, SFORCO, SFODCO
For the DIRT-distance, we add more new states to the SFODCO automaton that correspond to the operation of transposition and also the corresponding pair of edges for every transposition.
Animation program by Mr. Pojer for the discussed search automata is available for download from the course web page and is also installed in B311.
Simulation of NFA or determinisation? A hybrid approach.
The Prague Stringology Club and its conference series: see http: //www. stringology. org/.
© Searching with text preprocessing; indexing methods.
© Methods of indexing.
© Automatic indexing, thesaurus construction.
© Ways of index implementation.
Large amount of texts? The text preprocessing!
is1 Index, indexing methods, indexing file, indexsequential file.
is1 Hierarchical structuring of text, tagging of text, hypertext.
"S" Questions of word list storing [lexicon) and occurrence (hit) list storing, their updating.
Petr Sojka PV030 Textual Information Systems
ndexing Methods
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
is1 granularity of the index items: document-- word
paragraph - sentence
docA doc2 docb
word A A O A
word 2
A
A
O
word 3 O A A
"S" inverted file, transposition
docA    doc2 docb
word A	A	0	A
word 2	A	A	0
word 3	0	A	A
word4	A	A	A
wordA A A A
is1 Word order (primary key) in index —> binary search
Time complexity of one word searching in index: n index, length, V pattern length 0(V x log2(n))
is1 searching for k words, pattern p = v^,..., K<n=) repeated binary eearch
s average pattern length, complexity? 0(6 x k x log2 n)
is1 As long as k and i are comparable: double dictionary method. is" Hashing.
However the speed O(n) even 0(loq n) isn't usually sufficient, 0(A) is needed.
An appropriate choice of data structures and algorithms is for the index implementation crucial.
is1 Use of inverted file:
word A	A    0 A
word 2	A    A 0
word5	0    A A
word A	AAA
is1 Use of document list:
word A	A,3
word.2.	A, 2
word5	2,3
wordA	A, 2,3
is1 Coordinate system with pointers has 2 parts: a dictionary with pointers to the document list and a linked list of pointers to documents.
is1 manual vs. automatic, pros/cons
us- stop-list (words with grammatical meaning - conjunctions, prepositions,...) Q not-driven
Q driven (a special dictionary of words: indexing language assessment) - pass-list, thesaurus.
is1 synonyms and related words.
is1 inflective languages: creating of registry with language support-lemmatization.
Petr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Frequency of word occurrences is for document identification significant. English frequency dictionary:
-1    the    6997-1    0.070 6 in 2-134-1 0.-12Ö
2 of      364-1-1    0.073 7 that -10595 0.074
3 and    2ÖÖ52    0.0Ö6 Ö is -10099 OX
4 to      26-149    0.-104 9 was 90-16 0.(
5 a       23237    0.-1-16 -10 he 9543 0.095
"3" Zipfs law (principle of least resistance) order x frequency = constant
"3" Cumulative proportion of used words CPW =
text words count
«3* The rule 20-30: 20 % of the most frequent worde make &0 % of text [MEL, fig. 4.19].
lorder=l frequencyi
order
Automatic Indexing method is based on word significance derivation from word frequencies (cf. Collins-Cobuild dictionary); worde with low and high frequency are cut out: INPUT: n documents
OUTPUT: a list of worde suitable for an index creation
Q We calculate a frequency for every document / G (1, n) and
every word fee [K is a count of different worde In all documents].
O We calculate TOTFKEQk = £"=i FRECfo.
Q We create a frequency dictionary for the worde k £ (A,K).
Q We set down a threshold for an exclusion of very frequent worde.
Q We set down a threshold for an exclusion of worde with a low frequency.
Q We insert the remaining worde to the index.
Questions of threshold determination [MEL, fig. 4.20].
Petr Sojka PV030 Textual Information Systems
Fetr Sojka FV030 Textual Information Systems
Lemmatization for index creation
is1 Excursus to the computational linguistics. is1 Corpus linguistics as an TIS example.
is1 Search methods with preprocessing of text and pattern (query).
Morphology utilization for creating of dictionary
is1 stem/ root of words (ucit, uc);
is1 program ajka (abin),
http: //nip. f i .muni . cz/proj ekty/ajka/ examples;
is1 a techniques of patterns for stem determination;
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
"S" Thesaurus-a dictionary, containing hierarchical and associative relations and relations of equivalence between particular terms.
"S" Relations between terms/lemmas:
• eynonyme - relation to a standard term; e.g. „see";
• relation to a related term (RT); e.g. „see also";
• relation to a broader term (E3T);
• relation to a narrower term (NT);
• hypernyme (canmeans of transport); hyponyme (bird:jay); meronym (doonlock); holonyme (hand:body); antonyma {good-bad).
"S" Dog/Fík, Havel/president
manually/ half-automatically is1 heuristics of thesaurus construction:
• hierarchical structure/s of thesaurus
• field thesauri, the semantics is context-dependent (e.g. field, tree in informatics)
• compounding of terms with a similar frequency
• exclusion of terms with a high frequency
is1 breadth of application of thesaurus and lemmatizer: besides of spelling indexing, base of grammar checker, fulltext search.
«■ projekts WORDNET, EUROWORDNET
"S" module add wordnet; wn
wn faculty -over -simsn -coorn
FetrSojka PV030 Textual Information Systems
is1 Knowledge base creation for exact evaluation of document relevance.
"S1 topic - processing of semantic maps of terms Visual Thesaurus http://www.visualthesaurus.com.
is1 Tovek Tools, Verity.
Fetr Sojka FV030 Textual Information Systems
Index System Implementation
Excursus to the Computational Linguistics
FetrSojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Index System Implementation
Index System Implementation
is1 string searching - words are strings of letters. is1 word-forming - morphological analysis. is1 grammar (CFG, DF(3) - syntactic analysis. is1 meaning of sentences (TIL) - semantic analysis. is1 context - pragmatic analysis.
is1 full understanding and communication ability - information.
queries for positional attributes
• [word = „Havel"];
• [lemma = „prezident"] []* [lemma = „Havel"];
• ... ženu prezidenta Havla ...
[lemma = „hnát"] [] [lemma = „Havel"];
• [word = ,,žen(u|eme)" & lemma !=„žena"]; I ... or ! ... not
some other possibilities
• [lemma = „prezident"] []* [lemma = „Havel"] within s;... -10, 3 s
• [lemma = „Havel"] within 20 </s>„Pravda"
• <s>a:[word= „Zena|Muž|Clověk"] []* [lemma = a.lemma]
basic queries
• „Havel";
45:   Český prezident Václav <Havel> se včera na 09: jak    řekl        Václav <Havel> ,     každý občan 24Ô: více   než        rokem <Havel> řekl Pravda vítězí
regular expressions
• „Pravda | pravda";
• ,,(P|p)ravda";
• ,,(P|p)ravd[a,u,o,y]";
• „pravd.*"; „pravd.+"; „post?el"; word sequence
• ,,prezident(a|u)" ,,Havl(a|ovi)";
• „a tak";
• „prezident"; []* „Havel";
• „prezident" („republiky" „Václav")? „Havel";
Petr Sojka PV030 Textual Information Systems
Index System Implementation
ching
Large computational power of today's computers enables:
• efficient storing of large amount of text data (compression, indexing);
• efficient search for text strings.
A man sitting behind a computer uses all this, to obtain from so processed documents information, that he is interested. Really?
Example: In text database there is stored a few last years of daily newspaper. I'd like
to obtain information about president Vaclav Havel.
a/>HAVEL
b/>more precise queries
Computer (computational power)
+
human (inteligence)
valuable information
The goal of
+ time + money
everybody —> is to transfer the largest possible part of intelligence (time, money, to computer.
Petr Sojka PV030 Textual Information Systems
Petr Sojka PV030 Textual Information Systems
Index System Implementation
System Implementation
on from natural language
information	ideal of ideals	no	Searching	
pragmatic analysis	context	no	information	Correct
semantic analysis	sentence meaning TIL	starting-up	Spell	translation
syntactic analysis	grammar CFG, DCG	partially	check	
morphological analysis	word-forming lemma	yes	Check	Simple translat
words are strings of letters	string searching	yes		
Do we really know, what is information contained in the text in natural language? • František Novák weighs -100 kg. ->
object   property value ' František Novák likes beer. František Novák likes Jana Novotná
' F. N. is an old honest man. Spring erupted in full force.
2
RDB
attribute, attribute, key value
Words of the natural language denote objects, their properties and relations between them. It's possible to see the words and sentences also as ..functions" of its kind, defined by their meaning.
• A man, who climbed a highest Czech eight-thousander, is my grandson.
"S1 Corpus: electronic collection of texts, often indexed by linguistic tags.
is1 Corpus as a text information system: corpus linguistics.
is1 BNC, Penn Treebank, DESAM, PNK,...; ranges from millions to billion positions (words), special methods necessary.
is1 Corpus managers CQP, GCQP, Manatee/Bonito, http://www.fi.muni.cz/~pary/
see [MAR].
Definition: Corpus is a large, internaly structured compact fie of texts in natural language electronically stored and processable.
9 Indian languages have no script - for a finding of a grammar it's necessary to write up the spoken word.
• -1967 -A. corpus in U. S. A. (kučera, Francis) A 000000 words.
• Noam Chomsky - refuses corpora.
• Today - massive expansion.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Index System Implementation
• WWW page of Pavel Rychlý (~pary) links to basic information. Bonito, Manatee.
• IMS CORPUS WORKBENCH - a toolkit for efficient representation and querying over large text files.
Fetr Sojka Index System Implementation	FV030 Textual Information Systems
Internal architecture of corpus	
Two key terms of internal representation of position attributes are:
• Uniform representation: items for all attributes are encoded as integer numbers, where the same values have the same digital code. A sequence of items is then represented as a sequence of integers. Internal representation of attribute word (as well as of any other pos. attribute) is array (0. .p-1) of Integer, where p is position count of corpus.
• Inverted file: for a sequence of numbers representing a sequence of values of a given attribute, the inverted file is created. This file contains a set of occurrences in position attribute for every value (better value code). Inverted file is needed for searching, because it directly shows a set of occurrences of a given item, the occurrences then can be counted in one step.
Index System Implementation
Sequence of words at numbered positions (first word, nth word), to which tags are added (addition of tags called corpus tagging). Tags are morphological, grammatical and any other information about a given word. It leads to more general concept of position attributes, those are the most important tagging type. Attributes of this class have a value (string) at every corpus position. To every of them one word is linked as a basic and positional attribute word. In addition to this attribute, further position attributes may be bundled with each position of any text, representing the morphological and other tags.
Structural attributes - sentences, paragraphs, title, article, SGML.
P0S2 P0S1
LEMMA     český     prezident    vadav        have! dnes WORD     Českého   prezidenta   Václava       Havla dnes
0 12 3
File with encoded attribute values and inverted file as well have auxiliary files.
9 The first data structure is a list of items or Jexicon": it
contains a set of different values. Internally it's a set of strings occurring in the sequence of items, where a symbol Null (octal 000) is inserted behind every word. The list of items already defines a code for every item, because we suppose the first item in the list to have a code 0, following A etc.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Internal architecture of corpus (cont.)
There are three data structures for the inverted file:
• The first is an independent inverted file, that contains a set of corpus positions.
• The second is an index of this file. This index returns for every code of item an input point belonging to an occurrence in inverted file.
• The third is a table of frequency of item code, which for each item code gives a number of code occurrence in corpus (that is of course the same as the size of occurrence set).
Index System Implementation
Preprocessing of text and pattern (query): overwhelming majority of today's TIS. Types of preprocessing:
"S" n-gram statistics (fragment indexes).
"S" special algorithms for indexes processing (coding, compression) and relevance evaluation (PagePank Google)
"S" usage of natural language processing methods (morphology, syntactic analysis, semantic databases) an aggregation of information from multiple sources (systems AnswerBus, STAPT).
"S" signature methods.
SELECTED ITEMS
Accuracy
tp+ fp+fn+tn
Recall (Sensitivity) = T^j7l
Precision (Positive Predictive Value) = -^-^
False Positive Rate = f^tn
False Negative Rate — lp+jn
Specificity = ln+y
Negative
Predictive Value = tJ+fn
Definition: Relevance (of answers to a query) is a rate range, by which a selected document coincides with requirements imposed on it. Ideal answer = real answer
Definition: Coefficient of completeness (recall) R = ^, where m is a count of selected relevant records and n is a count of all relevant records in TIS.
Definition: Coefficient of precision P = ^, where o is count of all
selected records by a query.
We want to achieve maximum K and P, tradeoff.
Standard values: gO% for P, 20% for R. Combination of completeness and precision:
coefficient Fj, = ^p£^jp- (Po = P Pco = P, where F^ = F P and R weighted equally).
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Index System Implementation
Fragment index
< System Implementation
is1 The fragment ybd is in English only in the word molybdenum.
is1 Advantages: fixed dictionary, no problems with updates.
is1 Disadvantages: language dependency and thematic area, decreased precision of search.
is1 Google as an example of web-scale information system.
is1 Jeff Dean's video - historical notes of Google search developments.
is1 Google - system architecture.
is1 Google - PageRank.
is1 Google File System.
is1 Implementation of index systems
An example of anatomy of global (hyper)text information system (www.google.com).
is1 1997: google.stanford.edu, students Page and Brin
is1 1990: one of few quality search engines, whose basic
fundamentals and architecture (or at least their principles) are known - therefore a more detailed analysis according to the article [GOO]
http://www7.conf.au/programme/fullpapers/1921coml921.htm.
is1 2012: clear leader in global web search
"S" Several innovative concepts: PageRank, storing of local compressed archive, calculation of relevance from texts of hypertext links, PDF indexing and other formats, Google File System, Google Link...
"S" The system anatomy, see [MAR]
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
The crucial thing is documents' relevance (credit) computation.
"S" Usage of tags of text and web typography for the relevance calculation of document terms.
"S" Usage of text of hyperlink is referring to the document.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
"S" PaqeKank: objective measure of page importance based on citation analysis (suitable for ordering of answers for queries, namely page relevance computation).
"S" Let pages T^,.. .,Tn [citations) point to a page A, total sum of pages is m. PageRank
(A-d) , JPR(U) , PR(Tn)
PR{A) =
d
m
C(Tn)
"S" PageRank can be calculated by a simple iterative algorithm (for tens of millions of pages in hours on a normal PC).
"S" PageRank is a probability distribution over web pages.
"S" PageRank is not the only applied factor, but coefficient of more factors. A motivation with a random surfer, dumping factor d, usually around 03d.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
plementatio
ka PV030 Textua
on
>n
is1 Storing of file signatures
is1 Storing of lexicon
is" Storing of hit list.
is1 Google File System
is1 Inverted file - indexing file with a bit vector.
"S" Usage of document list to every key word.
is1 Coordinate system with pointers [MEL, fig. 4.A&, page 46].
is1 Indexing of corpus texts: Finlib
http: //www. f i .muni . cz/~pary/dis .pdf see [MAR].
is1 Use of Elias coding for a compression of hit list.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Index System Implementation
implementation (cont.)
Index System Implementation			
Dictionary repr	'esentation by FA		1
is1 Efficient storing of index/dictionary [lemmas]: packed trie,
Patricia tree, and other tree structures.
"S" Syntactic neural network (S. M. Lucas: Rapid best-first retrieval from massive dictionaries, Pattern Recognition Letters 17, p. -1507-15-12, -1996).
is1 Commercial implementations: Verity engine, most of web search engines - with few exceptions - hide their key to success.
Article M. Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing see [MAR]
is1 Dictionary representation by finite automaton.
is1 Ambiguities, unification of minimized deterministic automata.
is1 Example: done,do.V3:PP done,done.AO
is1 Morphological dictionary as a list of pairs [word form, lemma]. is1 Compaction of storing of data structure of automata (Liang, 19&3).
is1 Compression ratio up to -1:20 in the linear approach (given the length of word).
is1 Transducer for dictionary representation.
is1 Deterministic transducer with A output (subsequential
transducer) for dictionary representation including one string on output (information about morphology, hyphenation,...).
is1 Deterministic transducer with poutputs (p—subsequently!
transducer) for dictionary representation including more strings on output (ambiguities).
is1 Determinization of the transducer generally unrealizable (the class of deterministic transducers with an output is a proper subclass of nondeterministic transducers); for purposes of natural language processing, though, usually doesn't occur [there aren't cycles).
Petr Sojka Index System Implementation	PVC	>30 Textual Information Systems
Dictionary representation by FA 1		
is1 An addition of a state to a transducer corresponding (ia^.ia^) without breaking the deterministic property: first a state for [w^,e), then with resulting state final state with output W2-
is1 Efficient method, quick, however not minimal; there are minimizing algorithms, that lead to spatially economical solutions.
is1 Procedure: splitting of dictionary, creation of det. transducers with p outputs, their minimization, then a deterministic unification of transducers and minimizing the resulting.
is1 Another use also for the efficient indexing, speech recognition, etc.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Coding
is" Coding.
is1 Entropy, redundancy.
is1 Universal coding of the integers.
is1 Huffman coding.
is1 Adaptive Huffman coding.
Definition: Alphabet A is a finite nonempty set of symbols. Definition: Word {string, message) over A is a sequence of symbols from A.
Definition: Empty string e is an empty sequence of symbols. A set of nonempty words over A is labeled A+.
Definition: Code K is a triad (5, C, f), where 5 is finite set of source units, C is finite set of code units, f : 5 —> C+ is an injective mapping, f can be expanded to S+ —> C+: F(S-i S2 • • • 5k) = f (S-i )f (S2) • • • f (S/<). C+ is sometimes called code.
Definition: x £ C+ is uniquely decodable regarding f, if there is maximum one sequence y G S+ so, that f (y) = x. Definition: Code K = (S,C,f) is uniquely decodable if all strings in C+ are uniquely decodable.
Definition: A code is called a prefix one, if no code word is a prefix of another.
Definition: A code is called a suffix one, if no code word is a suffix of another.
Definition: A code is called a affix one, if it is prefix and suffix code. Definition: A code is called a full one, if after adding of any additional code word a code arises, that isn't uniquely decodable.
Petr Sojka FV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Definition: Compression (coding), decompression (decoding):
Definition: Block code of length n is such a code, in which all code words have length n. Example: block    ? prefix block => prefix, but not vice versa.
Definition: A code K = (5,C,f) is called binary, if \C\ = 2.
original data
Compression (encoding)
compressed data
<-
Decompression (decoding)
<-
Definition: Compression ratio is a ratio of length of compressed data and length of original data.
Example: Suggest a binary prefix code for decimal digits, if there are often numbers 3 a 4, and rarely 5 and 6.
Let Y be a random variable with a probability distribution
p(y) = P(Y = y). Then the mathematical expectation (mean rate)
yer
Let 5 = {x-i,X2,... ,xn} be a set of source units and let the occurrence probability of unit x,- in information source S is p, for i = A,...,n, n e N.
Definition: Entropy of information content of un/tx,- (measure of amount of information or uncertainty) is H(x,-) = H, = — log2 p, bits. A source unit with more probability bears less information.
Definition: Entropy of information sourceS is H(S) = — ^ p, log2 p, bits.
< A f A
True, that H(5) = ^ p(y) log — = £ [Jog —
Definition: Entropy of source message X = xhx,2 ... x,,. £
k
information sources is H(X, S) = H(X) = ^ H, = — ^ log2 p,\ bits. Definition: Length l(X) of encoded message X
k k
l(X) = ^\f(xij)\ = ^dijbits. Theorem: l(X) > H(X,S).
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Axiomatic introduction of entropy see [MAR], details of derivation see
ftp://www.math.muni.cz/pub/math/people/Paseka/lectures/kodovani.ps
k
Definition: K{X) = l{X) — H{X) = ^(^ + log2 Pij) 15 redundancy of code K. for message X.
n
Definition: Average length of code word K is AL{K) = ^pid, bits. Definition: Average length of source S is
n n
AE{5) = ]T piH, = -    p, log2 p, bits. Definition: Average redundancy of code K is
n
AR{K) = AL(K) - AE(S) = £p;(4 + log2p;) bits.
Definition: A code is an optimal one, if it has minimal redundancy. Definition: A code is an asymptotically optimal, if for a given distribution of probabilities the ratio AL(K)/AE(S) is dose to A, while the entropy is close to co.
Definition: A code K is a universal one, if there are c^,c2 &K so, that
average length of code word AL(K) <    x AE + c2.
Theorem: Universal code is asymptotically optimal, if    = A.
Definition: Fibonacci sequence of order m
Fn = Fn_m + Fn_m_H + ... + Fn_A for n > A.
Example: F of order 2: F_-| = 0„ Fq = A, F^ = A, F2 = 2, F3 = 3,
F4 = 5, F5 =    ..
Example: F of order 3: F_2 = 0, F_-| = 0, Fq = A, F^ = A, F2 = 2, F3 = 4,F4 = 7,F5 = 13,...
Example: F of order 4: F_3 = 0, F_2 = 0, F_^ = 0, Fo = A, F^ — A, F2 = 2, F3 = 4, F4 = &,F5 = 15,...
Definition: Fibonacci representation R{N) = ^j=J[ d\F\, where d E {0,A},dk = A
Theorem: Fibonacci representation is ambiguous, however there is such a one, that has at most m — A consecutive ones in a sequence d\.
Definition: Fibonacci code of order m FKm(N) = d^d2 ... dk A .. .A ,
m—A krát
where d, are coefficients from previous sentence (ones end a word). Example: £(32) = 0*1+0*2 + 1*3 + 0*5 + 1*0 + 0*13 + 1*21, thus F(32) = 00101011.
Theorem: FK(2) is a prefix, universal code with    = 2,c2 = 3, thus it isn't asymptotically optimal.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
TI
ers II
us- unary code a(N) = 00.. .01.
«s* binary code P(A) = 1,|3(2N + j) = |3(N)j, j = 0,1.
na* |3 is not uniquely decodable (it isn't prefix code),
us- ternary t(N) = P(N)#.
m- p>(A) = 6,P'{2M) = P'{H)0,p/{2N + 1) = ^/(N)/I, t'(N) = £'(N)#.
"3" v: every bit|3'(N) is inserted between a pair from a(\P(N)\).
us- example: v(6) = 01001
us- CK = (v(N) : N > 0} = (0{0,1})*1 is regular and therefore it's
decodable by finite automaton.
is- /(N) = ailPityDfy{N) the same length (bit permutation p(N)), but more readable
car Cy/ = {y'(N) : N > 0} = {OH {0,1 }k :   > 0} is not regular and the decoder needs a counter
«■ <?(N) = KlP(N)lF(N) ■ts- example: £(4) = ^(3)00 = 01100 ■ts- decoder <?: <?(?) = 0011? "a? to: K := 0;
while L.lo^2(N)J > 0 do begin K := /3(N)K;
N:= Uog2(N)J
end.
FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
V030 Textual Information Systems
is1 Information encoding for communication purposes.
is1 Despite tumultuous evolution of capacities for data storage, there is still a lack of space, or access to compressed data saves time. Redundancy —> a construction of a minimal redundant code.
is" Data model:
• structure - a set of units to compression + context of occurrences;
• parameters - occurrence probability of particular units.
• data model creation;
• the actual encoding.
"S" 1£3£> Morse, code e by frequency.
"S" 1949 Shannon, Fano, Weaver.
"S" 1952 Huffman; 5 bits per character.
"S" 1979 Ziv-Lempel; compress (Roden, Welsh, Bell, Knuth, Miller, Wegman, Fiala, Green,...); 4 bits per character.
us1 eighties and nineties PPM, DMC, gzip (zlib), SAKDC; 2-3 bits/character
"sf at the turn of millenium bzip2; 2 bits per character.
igf ?
Petr Sojka FV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Huffman coding Adaptive dictionary methods
1950   1960   1970   19Ô0   1990 2000 YEAR
Petr Sojka PV030 Textual Information Systems
is1 models of order 0 - probabilities of isolated source units (e.g. Morse, character e)
is1 models with a finite context - Markov models, models of order n
(e.g. Bach), P{a\x^X2 ■ ■ ■ xn) is1 models based on finite automata
• synchronization string, nonsynchronization string
• automaton with a finite context
• suitable for regular languages, unsuitable for context-free languages, P{a\c\j)
is1 redundancy (non-uniform probability of source unit occurrences) is1 encoder, decoder, model
is1 statistical modeling (the model doesn't depend on concrete data)
is1 semiadaptive modeling (the model depends on data, 2 passes, necessity of model transfer)
is1 adaptive modeling (only one pass, the model is created dynamically by both encoder and decoder)
is1 Huffman coding.
is1 Adaptive Huffman coding.
is1 Aritmetic coding.
is1 Dictionary methods.
"S" Signature methods.
is1 Similarity of documents.
is1 Compression using neural networks.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Character techniques
is1 null suppression - replacement of repetition > 2 of character null, 255, special character 5C
is1 run-length encoding (RLE) - 5CXCC generalization to any repetitious character *55 —> $SC * 655
MNP Class 5 PLE - CXXX DDDDDBBAAAA -» bDDDBBAAAA
■sp half-byte packing, (EBCDIC, ASCII) SI, SO
is1 diatomic encoding; replacement of character pairs with one character.
us- Byte Pair Encoding, BPE (Gage, -1994)
"S" pattern substitution
is1 Gilbert Held: Data & Image Compression
m
Input: a sequence of n source unite S[/], A < i < n, in order of nondecreasing probabilities.
Output: n binary code words.
begin assign to all code words an empty string; SF-SPLIT(S)
end
procedure SF-SPLIT(S); begin if \5\ > 2 then
begin divide 5 to sequences 5-1 and 52 so, that both
sequences have roughly the same total probability; add to all code words from 5A O; add to all code words from 52 -1; SF-SPLIT(SI); SF-SPLIT(S2);
end
Huffman coding Adaptive dictionary methods		
Statistical compression methc		ode II
is1 Shannon-Fano, A949, model of order 0,
is1 code words of length     log2pJ or     log2 p,- + A J
/£ < AL < AE + A. vs- code tree (2,2,2,2,4,4,g>).
is1 generally it is not optimal, two passes of encoder through text,
static ■
*x
Huffman coding Adaptive dictionary methods
is" Huffman coding, -1952.
"S" static and dynamic variants.
«■ AEPL = dWpW-
is1 optimal code (not the only possible).
"S" O(n) assuming ordination of source units.
is1 stable distribution —> preparation in advance.
Example: (2,2,2,2,4,4,0)
end
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
		Huffman coding
		Adaptive dictionary methods
Huff mar	l codi	ng - sibling property
Huffman coding Adaptive dictionary methods
Huffman trees
Definition: Binary tree have a sibling property if and only if
Q each node except the root has a sibling,
© nodes can be arranged in order of nondecreasing sequence so, that each node (except the root) adjacent in the list with another node, is his sibling (the left sons are on the odd positions in the list and the right ones on even).
Theorem: A binary prefix code is a Huffman one <=> it has the sibling property.
is1 2n — A nodes, max. 2n — A possibilities, is1 optimal binary prefix code, that is not the Huffman one. "S" AR(X) < pn + 0,0&d, pn maximum probability of source unit. is1 Huffman is a full code, (poor error detection).
is1 possible to extend to an affix code, KWIC, left and right context, searching for *X*.
FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" F(3K (Faller, Gallager, Knuth)
is1 suppression of the past by coefficient of forgetting, rounding, A,
2 n
r, r , r .
is1 linear time of coding and decoding regarding the word length. «■ ALHD < 2ALH5.
Vitter ALHD < ALH5 + A. is1 implementation details, tree representation code tables.
is1 generalization of Huffman coding (probabilities of source units needn't be negative powers of two).
is1 order of source units; Cumulative probability cp-, = Pj
source units x,- with probability p,-.
is1 Advantages:
• any proximity to entropy.
• adaptability is possible.
• speed.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Dictionary methods of data compression
Definition: Dictionary is a pair D = (M, C), where M is a finite set of words of source language, C mapping M to the set of code words. Definition: L(m) denotes the length of code word C(m) in bits, for m e M.
Selection of source units:
• static (agreement on the dictionary in advance) 9 semiadaptive (necessary two passes trough text)
• adaptive
Huffman coding Adaptive dictionary methods
Source unit of the length n - n-grams Most often bigrams (n = 2)
9 n fixed
• n variable (by frequency of occurrence) 9 adaptive
(50 % of an English text consits of about 150 most frequent words) Disadvantages:
9 they are unable to react to the probability distribution of compressed data
• pre-prepared dictionary
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods |
Semiadaptive dictionary methods - dictionary creation procedure
Dictionary
Compressed data
Compressed dictionary   Compressed data
Advantages: extensive date (the dictionary is a small part of data -corpora; COP).
O The frequency of N-grams is determined for N = 1,2,____
O The dictionary is initialized by unigram insertion.
O N-grams with the highest frequency are gradually added to the dictionary. During K-gram insertion frequencies decrease for it's
components of (K — 1)-grams, (K — 2)-grams____If, by reducing
of frequencies, a frequency of a component is greatly reduced, then it's excluded from the dictionary.
Fetr Sojka FV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
			Huffman coding
			Adaptive dictionary methods
Out li	ne (	Week	thirteen)
Huffman coding Adaptive dictionary methods
ary method
is1 Adaptive dictionary methods with dictionary restructuring.
"S" Syntactic methods.
is1 Checking of text correctness.
"S" Querying and TIS models.
is1 Vector model of documents
is1 Automatic text structuring.
is1 Document similarity.
LZ77 - siliding window methods
LZ7S - methode of increasing dictionary
a	b	c	b	a	b	b	a	a	b	a	c	b
encoded part not enc. part
(window, N < SI92) ([B[ ~10-20 b)
In the encoded part the longest prefix P of a string in not encoded part is
searched. If such a string is found, then P is encoded using (/, J, A), where / is a distance of first character 5 from the border, J is a length of the string 5 and A is a first character behind the prefix P. The window is shifted by J + A characters right. If the substring 5 wasn't found, then a triple (0,0, A) is created, where Ale a first character of not encoded part.
al Information Systems
Huffman c Adaptive dictionary me
rer, Szymaf
\M\ = [N - 3) x 3 x t,t size of alphabet L(m) = \\oq2(N-B)-] + \\oq23\\ + \\oq2t\ Advantage: the search of the longest prefix [KMP]
• LZP uses a tree containing all the prefixes in the yet encoded part.
• The whole encoded yet encoded part is used as a dictionary.
9 Because the / in (/, j, a) can be large, the Elias code for coding of the integers is used.
Disadvantage: a growth of the tree size without any limitation => after exceeding of defined memory it's deleted and the construction starts from the beginning.
The code is a sequence of pointers and characters. The pointer (/, j) needs a memory as p characters => a pointer only, when it pays off, but there is a bit needed to distinguish a character from a pointer. The count of dictionary items is |M| = t + (N — 3) x [3 — p) (considering only substrings longer than p). The bit count to encode is
• L(m) = -1 + flog21] for m eT
9 L(m) = A + pog2/\f] + flog2(r3 - p)"] otherways.
(The length d of substring can be represented as 3 — p).
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
increasing dictionary
A pointer (/, j) (analogy to LZSS) If
• the window is not full (at the beginning) and
9 the compressed text is shorter than N,
the usage of log2 N bytes for encoding of / is a waste. LZF3 uses phasing for binary coding. - prefix code with increasing count of bits for increasing values of numbers. Elias code y. LZSS, where for pointer encoding the Huffman coding is used (i.e. by distribution of their probabilities => 2 through passes)
The main idea: the dictionary contains phrases. A new phrase so, that an already existing phrase is extended by a symbol. A phrase is encoded by an index of the prefix and by the added symbol.
	Fetr Sojka
	Huffman coding
	Adaptive dictionary methods
LZFG (Fi,	ala, Green)
'V030 Textual Information Systems
Input	a	b	ab	c	ba
Index	1	2	3	4	5
Output	(Oa)	(Ob)	(1,b)	(0,c)	(2,a)
Input	bab	aa		
Index	6	7	ô	9
Output	(5,b)	(1,a)	(7,a)	(ô,a)
A dictionary is stored in a tree structure, edges are labeled with strings of characters. These strings are in the window and each node of the tree contains a pointer to the window and identifying symbols on the path from the root to the node.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Huffman coding Adaptive dictionary methods
The output indexes are only, or
• the dictionary is initiated by items for all input symbols
• the last symbol of each phrase is the first symbol of the following phrase.
Input    ababcbababaa    a    a a
Index        4 5      6 7      & 9-10
Output -12      43      5 S-1 -10-1-1
Overflow => next phrase is not transmitted and coding continues statically, it's a LZW +
• Pointers are encoded with prolonging length.
• Once the compression ratio will decrease, dictionary will be deleted and it starts from the beginning.
As LZC, but when a dictionary overflows, phrases, that were least used in the recent past, are excluded from the dictionary. It uses phrasing for binary coding of phrase indexes.
As LZT, but a new phrase isn't created by one character addition to the previous phrase, but the new phrase is constructed by concatenation of two last encoded ones.
Another principle of dictionary construction.
• At the beginning only the single symbols are inserted.
• Dictionary is stored in a tree and contains all the substrings processed by string of the length up to h.
• Full dictionary =>
• statical procedure,
a omitting of nodes with low usage frequency.
is1 Ongoing organization of source units —> shorter strings of the code.
is1 Variants of heuristics (count of occurrences, moving to the beginning (F3STW), change the previous, transfer of X forward).
is1 F3STW (advantage: high locality of occurrences of a small number of source units.
is" Example: I'm not going to the forest, ..., An2nkn.
is1 Generalization: recency coefficient, Interval coding.
Representation of the word by total sum of words from the last occurrence.
The dictionary contains words a-\, a2< ■■■< <3n, input sequence contains x-1, x2,..., xm. The value LAST(a,-) containing the interval form last occurrence is initialized to zero.
for t := A to m do begin {xt = a,}
if LAST(xt = 0) then y(t) = t + / - 1
else y(t) = t - LAST(xt);
LAST(xt):=t
end .
Sequence y-1.y2.---.ym |S ^n output of encoder and can be encoded by one code of variable length.
Fetr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is1 the grammar of the message language is known. is1 left partition of derivational tree of string. is1 global numbering of rules. is1 local numbering of rules.
is1 Decision-making states of LR analyzer are encoded.
is1 fixed context - model of order N.
is1 combined approach - contexts of various length.
is1 ia/„ fixed, variable.
is1 time and memory consuming.
is1 assignment of probability to the new source unit: e =
is1 automata with a finite context.
is1 dynamic Markov modeling.
FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
rectness of the text
V030 Textual Information Systems
Huffman coding Adaptive dictionary methods
cient
is1 Checking of text using frequency dictionary.
is1 Checking of text using a double frequency dictionary.
is1 Interactive control of text (ispell).
is1 Checking of text based on regularity of words, weirdneee coefficient.
Weirdneee coefficient of trigram xyz
KPT = [log(f(xy) - A) + log(f(yz) - 1)]/2 - log(f(xyz) - 1), where f(xy) resp. f(xyz) are relative frequencies of bigram resp. trigram, log(O) is defined as —AO.
Weirdneee coefficient of word KPS = \Jj^{KPTi - 5KPT2), where
KPT, is a weirdness coefficient of /-th trigram 5KPT is a mean rate of weirdness coefficients of all trigrams contained in the word.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Outline (Week fourteen)
Querying and TIS models, is" Boolean model of documents, •s" Vector model of documents, ns TIS Architecture.
Signature methods, us" Similarity of documents. «s" Vector model of documents (completion) bs" Extended boolean model, i®" Probability model, is Model of document clusters.
TIS Architecture, us" Automatic text structuring, i®" Documents similarity, us" Lexicon storage, is Signature methods.
Petr Sojka PV030 Textual Information Systems
Blair's que
Different methods of hierarchization and document storage different possibilities and efficiency of querying.
i®1 Boolean model, SQL.
«s" Vector model.
is Extended boolean types.
Probability model.
is" Model of document clusters.
Petr Sojka PV030 Textual Information Systems
PV03<
Boole.
ntic querying
The search lies in reducing of uncertainty of a question. Q We find a document with high relevance. Q We start to query with it's key words. O We remove descriptors, or replace them with disjunctions.
System http: //infomap. Stanford, edu- for working with searched meaning/concept (as opposed to mere strings of characters).
Right query formulation is the half of the answer. The search lies in determination of semantically closest terms.
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
"sf Fifties: representation of documents using sets of terms and querying based on evaluation of boolean expressions.
■sf Query expression: inductively from primitives:
• term
• attribute_name = attribute_value (comparison)
• function_name(term) (application of function)
and also using parentheses and logical conjunctions X and Y, X or Y, X xor Y, not Y.
"sf disjunctive normal form, conjunctive normal form
"sf proximity operators
■sf regular expressions
"sf thesaurus usage
"sf boolean operators and, or, xor, not.
"sf positional operators adj, (n) words, with, same, syn.
■sf SQL extension: operations/queries with use of thesaurus
F3T(A) broader term
NT(A) Harrower term
PT(A) Preferred term
SYN(A) Synonyms of the term A
fsT(A) Related term
TT(A) Top term
ORACLE SQL*TEXTRETRIEVAL SELECT specification_of_items FROM specification_of_tables WHERE item
CONTAINS textov_expression
Example:
SELECT TITLE FROM BOOK WHERE ABSTRACT
CONTAINS 'TEXT' AND RT(RETRIEVAL) 'string'  'string'*   *'string' 'st?ing' 'str°/oing'  'stringa'  (m,n) 'stringb' 'multiword phrases' BT('string',n) BT('string',*) NT('string',n)
Example:
SELECT NAME FROM EMPLOYEE WHERE EDUCATION CONTAINS   RT(UNIVERSITA) AND LANGUAGES
CONTAINS 'ENGLISH' AND 'GERMAN'
AND PUBLICATIONS
CONTAINS 'BOOK' OR NT('B00K',*)
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
asoc{Q./\,Qö) = log i c
(fN- Aß-N/2)2N
AB(N — A)(N — 3)
A - number of documents „hit" by the query
3 - number of documents „hit" by the query     (its relevance we
count)
f - number of documents „hit" by both the queries N - total sum of documents in TIS cutoff (relevant/ irrelevant)
clustering/nesting A. generation, 2. generation,...
Vector model of documents: Vet a^,... ,an be terms, D^,... ,Dm documents, and relevance matrix W =       of type m, n,
WijE (0,1)
0 is irrelevant
1 is relevant
Query Q=       ..., qn)
• 5{Q,Dj) = ^i^Wij similarity coefficient
9 head(sort(S(Q,Dj))) answer
CONS: doesn't take into account ?"and"? ?"or"? PROS: possible improvement: • normalization of weights
• Term frequency TF
• Inverted document frequency IDF = log2 f
• Distinction of terme • normalization of weights for document:
TD
TD1
• normalization of weights for query: [POK, pages &5-113].
•i W — x
2      max TF;
is1 Interrelations between documents in TIS.
"S" Encyclopedia (OSN, Funk and Wagnalls New Encyclopedia).
«■ [SBA]
http://columbus.cs.nott.ac.uk/compsci/epo/epodd/ep056gs
is1 Google/CiteSeer: ..automatic structuring of text files"
x log
m 2 k
Petr Sojka PV030 Textual Information Systems
Petr Sojka
PV030 Textual Information Systems
	Boolean model
	
Lexicon storage	
Most often cosine measure - advantages.
Detailed overview of similarity functions see chapter 5.7 from [KOR] (similarity).
[MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural language Engineering, 2(A):<oA-&0, A996. http://www.research.att.com/~mohri/cll.ps. gz
Petr Sojka PV030 Textual Information Systems
Fetr Sojka
FV030 Textual Information Systems