PV030 Textual Information Systems
Petr Sojka
Faculty of Informatics Masaryk University, Brno
Spring 20-12
Petr Sojka FV030 Textual Information Systems
Basic info
Vart I
Information about the course FVOSO
Petr Sojka FV030 Textual Information Systems
• Petr Sojka, sojka@fi.muni.cz
• Consulting hours Spring 20-12: Wednesday -13:00-13:45 Friday-10:00-1-1:50
or write an email with other suggestions to meet.
• Room C523/522, fifth floor of block C, Botanická 68>a.
• Course homepage: http://www.fi.muni.cz/~sojka/PV030/
• Seminar (Thu -12:00-12:50, C5AA -> B5AA).
Petr Sojka FV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
Literature
Topice and classification of the course
Prerequisites:
It is expected the student having basic knowledge of the theory of automata and formal languages (IB005), elementary knowledge of theories of complexity, software programming and systems. Classification:
There is a system of classification based on written mid-term (30 points) and final (70 points) exams. In addition, one can get additional premium points based on the activities during lectures. Classification scale (changes reserved) z/k/E/D/C/B/A correspond to obtaining 4Ö/54/60/66/72/7Ö/Ö4 points.
Dates of [final] exams will be announced via IS.muni.cz (probably three terms).
Fetr Sojka FV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
My books focus on timeless truth.
D.E. Knuth, Brno, -1996
An emphasis will be given to the explanation of basic principles, algorithms and (software) design techniques, creation and implementation of textual information systems (TIS)—storage and information retrieval.
Fetr Sojka FV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
© Basic notions. TIS (text information system). Classification of
information systems. From texts to Watson. © Searching in TIS. Searching and pattern matching classification
and data structures. Algorithms of Knuth-Morris-Pratt,
Aho-Corasick, reg. expr. © Algorithms of Boyer-Moore, Commentz-Walter, Buczilowski. © Theory of automata for searching. Classification of searching
problems. Searching with errors. © Indexes. Indexing methods. Data structures for searching and
indexing.
© Google as an example of search and indexing engine. Pagerank.
Signature methods. © Query languages and document models: Boolean, vector,
probabilistic, MMM, Paice.
Fetr Sojka FV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
® Data compression. Basic notions. Entropy.
® Statistical methods.
® Compression methods based on dictionary.
O Syntactic methods. Context modeling. Language modeling. Corpora linguistics.
© Spell checking. Filtering information channels. Document classification. Neural nets for text compression.
Fetr Sojka FV030 Textual Information Systems
^ [MEL] Melichar, B.: Textové informačníeyetémy, skripta ČVUT Praha, 2. vydání, A996.
^ [POK] Pokorný, J., Snášel, V., Húsek D.: Dokumentografícké informační systémy, Karolinum Praha, A99&.
^ [KOR] Korfhage, R. R.: information Storage and Retrieval, Wiley Computer Publishing, -1997.
^ [SMY] Smyth, B.: Computing Patterns in Strings, Addison Wesley, 2003.
^ [KNU] Knuth, D. E.: The Art of Computer Programming, Vol. 3, Sorting and Searching, Second edition, -199&.
% [WMB] Witten I. H., Moffat A., Bell T. C: Managing Gigabytes: compressing and indexing documents and images, Second edition, Morgan Kaufmann Publishers, -199&.
Petr Sojka FV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
^ [HEL] Held, G;. Data and Image Compression, Too\s and Techniques, John Wiley & Sons, 4. vydání-1996.
U [MEH] Melichar B., Holub J., A 6D Classification of Pattern Matching Problems, Proceedings of The Prague Stringology Club Workshop '97, Prague, July 7, CZ.
I   [GOO] Brin S., Page, L: The anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7/Computer Networks 30(1-7): A07-AA7 {A998). http://dbpubs.Stanford.edu:8090/pub/1998-8
U   [MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural Language Engineering, 2{A):6A-80,A996.
http://www.research.att.com/~mohri/cll.ps.gz
Petr Sojka PV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
Literature
Hl   [Sch] Schmidhuber J.: Sequential neural text compression, /EEE Transactions on Neural Networks 7(1), -142—146, -1996, http://www.idsia.ch/~juergen/onlinepub.html
II   [SBA] Salton G., Buckley Ch., Allan J.: Automatic structuring of text files, Electronic Publishing 5(1), p. 1—17 (March 1992).
http://columbus.es.nott.ac.uk/compsci/epo/epodd/ep056gs.htm
U   [WWW] web pages of the course ~sojka/PV030/, DIS seminars
http: //www. inf .upol. cz/dis, http: //nip. f i .muni. cz/, The Prague Stringology Club Workshop 1996-200Ö http://es.felk.cvut.cz/psc/
^ Jones, S. K.., Willett: Readings in Information Retrieval, Morgan Kaufman Publishers, 1997.
Petr Sojka PV030 Textual Information Systems
Prerequisites and classification Basic info Course syllabus
Literature
I
^ Bell, T. C, Geary, J. G., Witten, I. H.: Text Compression, Prentice Hall, Englewood Cliffs, N.J., -199-1.
^ Storer, J.: Data Compression: Methods and Theory, Computer Science Press, Rockwille, -19ÖÖ.
Q journals ACM Transactions on information Systems, Theoreticai
Computer Science, Neural Network World, ACM Transactions on Computer Systems, Knowledge Acquisition.
knihovna.muni . cz, umarecka. cz (textbook Pokorný),
Petr Sojka PV030 Textual Information Systems
Notions and classification of IS (T)IS classification Information retrieval systems
Part II
Basic notions of TIS
Fetr Sojka FV030 Textual Information Systems
reality <—> data
T T
information need <—> query
is- Abstractions and mappings in information systems.
is" Information needs about the reality—queries above data.
is" Jeopardy game: Watson.
Petr Sojka PV030 Textual Information Systems
Notions and classification of IS
[T)IS classification Notions of (T)1S, PV030 in the context, of teaching at Fl M U
Information retrieval systems
SB
Definition: Information system is a system that allows purposeful arrangement of collection, storage, processing and delivering of information.
Definition: Ectosystem consists of IS users, investor of IS, and entrepreneur (user, funder, server). In the example of is.muni.cz they are users of IS, MU represented by bursar, and ICS and IS teams. Ectosystem is not under control of IS designer.
Definition: Endo&y&tem consists of hardware used {media, devices), and software (algorithms, data structures) and is under control of IS designer.
Fetr Sojka FV030 Textual Information Systems
is" effectiveness (user) is" economics (funder) is" efficiency (server)
and from different preferences implied compromises. Our view will be view of TIS architect respecting requests of IS ectosystem. For topics related to ectosystem of IS see PV045 Management IS.
Petr Sojka PV030 Textual Information Systems
Notions and classification of IS
[T)IS classification Notions of (T)IS, PV030 in the context, of teaching at FI M U
Information retrieval systems
From data to wisdom
• Data: concrete representation of a message in a form of sequence of symbols of an alphabet.
• Information: reflection of the known or the expected substance of realities. An information depends on the intended subject. Viewpoints:
• quantitative (information theory);
• qualitative (meaning, semantics);
• pragmatical (valuation: significance, usefulness, usability, periodicity, up-to-dateness, credibility;
• the others (promptness, particularity, completeness, univocality, availability, costs of obtaining).
a Knowledge (znalost).
a Wisdom (moudrost).
Fetr Sojka FV030 Textual Information Systems
Notions and classification of IS
[T)IS classification Notions of (T)IS, PV030 in the context, of teaching at FI M U
Information retrieval systems
Definition: Information process is a process of formation of information, its representation in a form of data, its processing, providing, and use. Operations with information correspond to this process.
Data/signals —> Information —> Knowledge —> Wisdom.
Fetr Sojka FV030 Textual Information Systems
Notions and classification of IS
(T)[S classification Mini questionnaire
Information retrieval systems
IS classification by the prevailing function
© Information retrieval systems.
© Database management systems (DBMS), relational DB (PF3-154, PF3-155, PV003, PV055, PV-136, P3AA4).
© Management information systems (PV045).
© Decision support systems (PV09Ö).
© Expert systems, question answering systems, knowledge-based systems (PA03-1).
© Information service systems (web 2.0).
Fetr Sojka FV030 Textual Information Systems
Notions and classification of IS
(T)[S classification Mini questionnaire
Information retrieval systems
g function (cont.)
© Specific information systems (geographical PVCM9, PA049, PA050, medical PV04Ö, environmental PV044, corporate PV043, state administration PV05Ö, PV059, librarian PV070); and also PV063. Application of database systems.
Related fields taught in Fl: Software engineering {?M02, PAA05). Similarity searching in multimedia data (PA-12Ö). Efficient use of database systems (PA-152). Introduction to information retrieval (PV2AA).
Fetr Sojka FV030 Textual Information Systems
Notions and classification of IS
(T)[S classification Mini questionnaire
Information retrieval systems
Fetr Sojka FV030 Textual Information Systems
© What do you expect from this course? What was your motivation to enroll? Is the planned syllabus fine? Any changes or surprises?
© What do you not expect (you would rather eliminate)?
© Which related courses have you already passed?
© Practising IS usage (as a user)
a) Which (T)IS do you use?
b) Intensity? Frequency? How many searching per month?
c) Are you satisfied with it?
© IS creation (server)
a) Which (T)IS and its component have you realized? Area, size?
b) Are you satisfied with it? Bottlenecks?
Petr Sojka PV030 Textual Information Systems
Notions and classification of IS
(T)IS classification Classification and formalization of IRS
Information retrieval systems
Information retrieval systems (IKS)—principles
	DB	
	1	
Query —	*   Search —	—>   Set of selected
	engine	documents
Fetr Sojka FV030 Textual Information Systems
DB 1
Definition of	
documents —>	
Output files format	
definition —>	SEARCH
Search method	ENGINE
definition —>	
Content of	
documents —>	
T
Queries
Set of ■ selected documents
Petr Sojka PV030 Textual Information Systems
Notions and classification of IS (T)IS classification Information retrieval systems
Classification and formalization of IRS
Searching—formalization of the problem
Concatenation: string of beads. A bead —> an element. Indexing of elements by natural numbers. Not necessarily numbers, but labels.
0) Every element has unique label.
1) Every labeled element x (except for the leftmost one) has a clear predecessor referred to as pred{x).
2) Every labeled element x [except for the rightmost one) has a clear successor referred to as eucc{x).
3) If the element x is not the leftmost one, x = eucc{pred{x)).
4) If the element x is not the rightmost one, x = pred{eucc{x)).
5) For every two different elements x and y, there exists a positive number /c that is either x = eucck(y) or
x = predk(y).
Fetr Sojka FV030 Textual Information Systems
Notions and classification of IS (T)IS classification Information retrieval systems
Classification and formalization of IRS
Searching—fcrmalizaticn cf the problem (cont.)
The concatenation term:
Definition: a string is a set of elements which meets the rules 0)-5).
Definition: a linear string: a string that has a finitely many elements including the leftmost and rightmost ones.
Definition: a necklace.
Definition: an alphabet A. letters of the alphabet. A+. An empty string e.
Definition: a finite chain A* = A+ U {s}. Definition: a linear string over A: a member of A+. Definition: a pattern. A text.
Fetr Sojka FV030 Textual Information Systems
© Classification according to the passing direction: left-to-right/right-to-left.
© Classification according to (pre)processing of the text and the pattern:
• ad fontes (searching in the text itself);
• text surrogate (searching in the substitution of the text);
• substitutions:
an index: an ordered list of significant elements together with references to the original text; a signature: a string of indicators that shows the occurrence of significant elements in the text.
Petr Sojka PV030 Textual Information Systems
Notions and classification of IS
(T)IS classification Classification and formalization of IRS
Information retrieval systems
IRS—classification (cont.)
	text	^reprocessing	
		no	yes
pattern preprocessing	no	1	III
	yes	II	IV
I - elementary algorithms
II - creating a search engine
III - indexing methods
IV - signature methods
Fetr Sojka FV030 Textual Information Systems
Notions and classification of IS
(T)1S classification Classification and formalization of IRS
Information retrieval systems
Searching—the formulation of the problem
Classification according to the cardinality of the patterns' set:
© Search for a single pattern V in the text T. The result: yes/no.
© Search for a finite set of patterns P = {i/^, 1/2» • • •»The result: information about position of some of the entered patterns.
© Search for an infinite set of patterns assigned by a regular expression R. R defines a potentially infinite set L(R). The result: information about position of some of the patterns from L(R).
Alternatives to the formulation of the searching problem:
a) the first occurrence;
b) the all occurrences without overlapping;
c) the all occurrences including overlapping.
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text [[ - Exact search with query preprocessing Isarp-Rabin search algorithm
Part III
Exact search
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text I
[[ - Exact search with query preprocessing Rudimentary search algorithm
Isarp-Rabin search algorithm
Na'fve search, brute force search, rudimentary search algorithm
proc Brute-Force-Matcher(PATTERN,TEXT): T:=length[TEXT]; P:=length[PATTERN]; for i:=0 to T-P do
if PATTERN[1..P]=TEXT[i+1..i+P]
then print "The pattern was found at the position i.";
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text I
[[ - Exact search with query preprocessing Rudimentary search algorithm
Isarp-Rabin search algorithm
Time complexity analysis of naive search
a The complexity is measured by number of comparison, the length of a pattern P, the length of text T.
• The upper estimate & = P ■ (T — P + A), thus 0(P x T).
• The worst case PATTERN = ap^t>, TEXT = aT^t>.
a Natural languages: (average) complexity (number of comparison) substantially smaller, since the equality of prefixes doesn't occur very often. For English: S = Cb ■ (T — P + A), Cb empirically measured A.07, i.e. practically linear.
• CCz?    CCz vs. CE?
• Anyspeedups?   An application of several patterns?   An infinite number?
a We will see the version (5, Q, Ct) of the algorithm in the seminar.
Fetr Sojka FV030 Textual Information Systems
Express the time complexity of the following search algorithms using the variables c and s, where c is the number of the tests and these statements are true:
• if the index / is found, then c = / and s = A;
• otherwise, c = T and s = 0.
Petr Sojka PV030 Textual Information Systems
input:    var TEXT : array[1..T] of word;
PATTERN : word; output (in the variable FOUND): yes/no
A I:=l;
c    while I< T do begin
c if TEXT[I]=PATTERN then break;
c-s inc(I) ;
end;
2    FOUND:=(I<T);
On the left side, there is the time complexity of the statements.
And so the overall time complexity is 0(T) = 3c — s + 3.
The maximum complexity (which is commonly stated) is 0(T) = 3T + 3.
Petr Sojka PV030 Textual Information Systems
[. SE without preprocessing both patterns and the text I
[[ - Exact search with query preprocessing Rudimentary search algorithm
Isarp-Rabin search algorithm
Algorithm Q or how about using the end step/skid (zarážka)
input: var TEXT : array[1..T+l] of word; PATTERN : word; output (in the variable FOUND): yes/no
A I:=l;
A     TEXT[T+1]:=PATTERN;
C     while TEXT [I] OPATTERN do
c-A inc(I);
2    FOUND :=(IOT+l)
In this case, the index is always found; therefore it is stated on the last but one line of the algorithm that the complexity is c — A instead of c — s (although they are equivalent), furthermore, it is necessary to realize that the maximal possible value of c is greater by one than in the previous algorithm (stating c + A instead of c would not be correct, though). The overall complexity: 0{T) = 2c + 5. The maximum complexity: 0(T) = 2T + 5.
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text I
[[ - Exact search with query preprocessing Rudimentary search algorithm
Isarp-Rabin search algorithm
Algorithm Q! or how about using the cycle expansion
input: var TEXT : array [1..T+l] of word; PATTERN : word;
output (in the variable FOUND): yes/no
A I:=l; A TEXT[T+1]:=PATTERN;
fc/2] while TEXT [I] OPATTERN do
begin
[c/2j if TEXT[1+1]=PATTERN then break;
L(c~1)/2J I: =1+2;
end;
3 FOUND : = (KT) or (TEXT [T]=PATTERN) ;
The overall complexity: 0(T) = c + L(c — "1)/2j +5.
The maximum complexity: 0(T) = T + |_T/2j +6.
The condition at the end of the algorithm guarantees its functionality
(however, it is not the only way of handling the cycle incrementation by
two).
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text I
[[ - Exact search with query preprocessing Rudimentary search algorithm
Isarp-Rabin search algorithm
© Watson
© Exact search methods I (without pattern preprocessing) -completion.
© Exact search methods II (with pattern preprocessing, left to right): KMP (animation), Rabin-Karp, AC.
© Search with an automaton.
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text I
[[ - Exact search with query preprocessing Rudimentary search algorithm
Isarp-Rabin search algorithm
© Yes: syllabus suits expectations; positively is awaited dissect of Google; indexing and search; examples.
© No: too much theory, deep digestion of algorithms.
© Examples.
© This year: further enrichment of information retrieval part (Google), textual (mathematical) digital libraries and languages enhancements of TIS (on the example of Watson).
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Isarp-Rabin search algorithm
© Search in text editor (Vim, Emacs), in the source code of a web page.
© Data search (biological molecules approximated as sequences of nucleotides or amino acids).
© Literature/abstracts search—recherche, corpus linguistics.
The size of available data doubles every Ad months (Moore's law) —> higher effectiveness of algorithms needed.
Fetr Sojka FV030 Textual Information Systems
ode>
During the preprocessing, structure of the query pattern(s) is examined and, on that basis, the search engine is built (on-the-fly).
Definition: exact (vs. fuzzy (proximitni)) search aims at exact match (localization of searched pattern(s)).
Definition: left-to-right (LR, eoueměrné) (vs. right-to-left (RL, protisměrné)) search compares query pattern to the text from left to right (vs. right to left).
Petr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Isarp-Rabin search algorithm
© -1 query pattern (vzorek):
• Shift-Or algorithm.
• Karp-Rabin algorithm, (KR, -19S7).
• Knuth-Morris-Pratt algorithm, (KMP, designed (MP) in -1970, published -1977).
© n patterns: Aho-Corasick algorithm, (AC, -1975).
© co patterns: construction of a search engine (finite automaton) for the search of a potentially infinite set of patterns (given as regular expression).
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Isarp-Rabin search algorithm
is" Pattern v^V2- ■ - vm over an alphabet I = a^,... ,ac.
,   . , . „ ,       ^ „      f 0   if i/; = a,-
is" Incidence matrix X [m x c), Xn = <   „      , J
J A otherwise.
is" Let matrix, column X corresponding to aj is named Ay
is" At the beginning, we put unitary vector/column into K. In every algorithm, step R moves down by one line/position, top-most position is filled by zero and one character aj is read from input. Resulted K is combined with Aj by binary disjunction: R := SHIFT(R) OR Aj.
is" Algorithm stops successfully when 0 appears at the bottom-most position in R.
Fetr Sojka FV030 Textual Information Systems
[. SE without prepi
II - Exact s
ng both patterns and the text :arch with query preprocessing Isarp-Rabin search algorithm
Example: V = vzorek over 1 = {e, /c, o, r, v, z}. Cf. [POK, page 3-1-32].
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text [[ - Exact search with query preprocessing Karp-Rabin search algorithm
Quite different approach: usage of hash function. Instead of matching of pattern with text on every position, we check the match only when pattern 'looks similar as searched text substring. For similarity, a hash function is used. It has to be
is" efficiently computable,
is" and it should be good at separating different strings (close to perfect hashing).
KR search is quadratic at the worst case, but on average 0(T + V).
Fetr Sojka FV030 Textual Information Systems
[. SE without preprocessing both patterns and the text [[ - Exact search with query preprocessing Karp-Rabin search algorithm
rc
lementation
#define REHASH (a, b, h) (((h-a*d)«l+b) void KR(char *y, char *x, int n, int m) { int hy, hx, d, i;
/* preprocessing: computation of d = 2m~A */ d=l; for (i=l; i<m; i++) d«=l; hx=hy=0;
for (i=0; i<m; i++)
{ hx=((hx«l)+x[i]); hy=((hy«l)+y[i]); } /* search */ for (i=m; i<=n; i++) {
if (hy==hx) && strncmp(y+i-m,x,m)==0) OUTPUT(i-m);
hy=REHASH(y[i-m], y[i], hy); } }
Fetr Sojka FV030 Textual Information Systems
[. SE without prepi
[[ - Exact s
ng both patterns and the text arch with query preprocessing Karp-Rabin search algorithm
Example: ([HCS, Ch.6]) V = /ng, T = string matching. Preprocessing: hash = A05 x 22 + AAO x 2 + A03 = 743. Search:
T=        s       t        r        i        n g
hash= 806   797   776   743 678
match        i       n g 585   443   746   7-19   766   709   736 743
Fetr Sojka FV030 Textual Information Systems
		[K)MP
Search engir	e (finite autor	naton)
Construct!.	?n of the KMP	engine
Part IV
Exact search of one pattern
Fetr Sojka FV030 Textual Information Systems
(K)MF
Search engine (finite automaton) Construction of the KMF engine
m
Idea: Inefficiency of naive search are caused by the fact that in the case of mismatch the pattern is shifted by only one position to the right and checking starts from the beginning. This does not use the information that was gained by the inspection of text position that failed. The idea is to shift as much as possible so that we do not have to go back in searched text.
Fetr Sojka FV030 Textual Information Systems
(K)MF
Search engine (finite automaton) Construction of the KMF engine
itl
var text: arrays ..T] of char; pattern: arrays ..V] of char; /, j: integer; found: boolean;
i := A; > text index
j := A; > pattern index
while (/ < T) and (j < V) do
while (j > 0) and (text[i] ^ pattern [j]) do
j ■■= m
end while
i:=i + A; j:=j + A end while
found := j > V; > if found, it is on the position / — V
Fetr Sojka FV030 Textual Information Systems
		(K)MF
Search engir	e (finite autor	naton)
Construct!.	?n of the KMP	engine
is" 0(T) complexity plus complexity of preprocessing (creation of the array h).
is" Animation of tracing of the main part of KMP.
Fetr Sojka FV030 Textual Information Systems
is" h is used when prefix of pattern 1/^1/2 ... Vj^ matches with substring of text        t;_j+2 • • •      and i/j 7^ t,-.
is" May I shift by more than A? By j? How to compute h?
is" h(j) the biggest k < j such that   1/2 • • • ^-1 's suffix of
^2 • • • ^-1, e.g. ^2 • • • ^-1 = Vj-k+2 ■ ■ ■ Vj-A and Vj ± vk.
is" KMP: backward transitions for so long, so that j = 0 (prefix of pattern is not contained in the searched text) or t; = Vj
[Vy\V2...Vj = tj-j+A ti-j+2 ■ ■ ■ 1;).
is" Animation Lecroq, also [POK, page 27], also see [MAR] for detailed description.
Petr Sojka FV030 Textual Information Systems
(K)MF
Search engine (finite automaton) Construction of the KMF engine
i:=l; j:=0; h[l]:=0; while (i<V) do
begin while (j>0) and (v[i]<>v[j]) do j:=h[j]; i:=i+l; j:=j+l; if (i<=V) and (v[i]=v[j]) then h[i]:=h[j] else h[i] :=j (*MP*) end;
Complexity of h computation, e.g. preprocessing, is 0{V), thus in total 0(T + V).
Example: h for ababa. KMPvs. MP.
Fetr Sojka FV030 Textual Information Systems
(K)MF
Search engine (finite automaton) Construction of the KMF engine
that uses transition table q derived from the searched pattern, (g relates to the transition function 6 of FA):
var i,T:integer; found: boolean;
text: array[l..T] of char; state,qO: TSTATE;
g:array[1..maxstate,1..maxsymb] of TSTATE;
F: set of TSTATE;...
begin
found:= FALSE; state:= qO; i:=0; while (i <= T) and not found do begin
i:=i+l; state:= g[state,text [i]]; found:= state in F; end; end;
How to transform pattern into g?
Fetr Sojka FV030 Textual Information Systems
(K)MP
Search engine (finite automaton) Construction of the KMP engine
-right search
is- 5E for left-to-right search A = (Q, T, g, h, q0, F)
9 Q is a finite set of states.
• T is a finite input alphabet.
• g: Q x T —> QU {fail,} is a forward state-transition function.
• h: [Q — qo) —> Qis a backward state-transition function.
• qo is an initial state.
• F is a set of final states.
is" A depth of the state q: d(q) £ No is a length of the shortest forward sequence of the state transitions from qo to q.
Petr Sojka PV030 Textual Information Systems
(K)MP
Search engine (finite automaton) Construction of the KMF engine
is" Characteristics q, h:
• Oj[c\p,a) 7^ fa\\ for Va 6 T [there is no backward transition in the initial state).
• lf/i(cj) = p, then d(p) <       (the number of the backward transitions is restricted from the top by a multiple of the maximum depth of the state c and the sum of the forward transitions V). So the speed of searching is linear in relation to V.
Fetr Sojka FV030 Textual Information Systems
(K)MP
Search engine (finite automaton) Construction of the KMF engine
is" SE configuration (q,w), q G Q, w G T* the not yet searched part of the text.
is" An initial configuration ofSE {qo,w), w is the entire searched text.
is" An accepting configuration of&E (q, w), qEf,w\s the not yet searched text, the found pattern is immediately before w.
«s- SE transition: relation hC (Q x T) x (Q x T*):
• g{q,a) = p, then {q,aw) h (p, w) forward transition for
\/w G r*.
• = p, then (cj, w) h (p, tv) backward transition for
\/w G r*.
Fetr Sojka FV030 Textual Information Systems
		(K)MF
Search engin	le (finite autor	naton)
Construct!.	?n of the KMP	engine
During the forward transition, a single input symbol is read and the engine switches to the next state p. However, if g{q,a) = fail, the backward transition is executed without reading an input symbol. S = 0(T) (we measure the number of SE transitions).
Fetr Sojka FV030 Textual Information Systems
(K)MP
Search engine (finite automaton) Construction of the KMF engine
Construction of the KMP SE for pattern   i/2 ... vY
© An initial state cjo-
© g{q,      ) = q', where q' is equivalent to the prefix i/^ 1/2 • • • Vjvj+A ■
© For qo, we define g{qo,a) = qoforVa, for which g{qo,a) has not been defined in the previous step.
© g{q,a) = fai|for     and a, for which g{q,a) has not been defined in the previous steps.
© A state that corresponds to the complete pattern is the final one.
© The backward state-transition function h is defined on the page 5-1 by the below mentioned algorithm.
Fetr Sojka FV030 Textual Information Systems
(K)MP
Search engine (finite automaton) Construction of the KMF engine
© Summary of the previous lecture, searching with SE.
© Left-to-right search of n patterns algorithms. (AC, NFA -» DFA.)
© Left-to-right search of infinite patterns algorithms.
© Regular expressions (RE).
© Direct construction of (N)FA for given RE.
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
Part V
Search of a finite set of patterns
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
SE for left-to-right search of a set of patterns p = {i/1, i/2,..., vF}.
instead of repeated search of text for every pattern, there is only "one" pass (FA).
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
var text: array[1..T] of char;
i: integer; found: boolean; state: tstate;
g: array[1..maxstate,1..maxsymbol] of tstate;
h: array[1..maxstate] of tstate; F: set of tstate; found:=false; state:=q0; i:=0; while (i<=T) and not found do begin i:=i+l;
while g[state,text[i]]=fail do state:=h[state];
state:=g[state,text[i]]; found:=state in F end
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
• Construction of the state-transition functions h, g? a How about for P patterns? The main idea?
• Aho, Corasick, -1975 (AC search engine).
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
Construction of q for AC SE for a set of patterns p = {i/1, i/2,..., i/p} © An initial state qp.
© g{q,        = cj', where q' is equivalent to the prefix b^2 ■■■ bj+A of the pattern v1, for V/ 6 {1,... ,P}.
© For qo> we define g(^o> a) = cjo f°r Va, for which g{qo, a) has not been defined in the previous steps.
© g{q,a) = faH for     and a, for which g{q,a) has not been defined in the previous steps.
© A state that corresponds to the complete pattern is the final one.
An example: p ={he, she, her} over T ={h, e, r, s, x}, where x is anything else than {h, e, r, s}.
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
Construction of h for AC SE for a set of patterns p = {i/1, i/2,..., i/p}
At first, we define the failure function f inductively relative to the depth of the states this way:
© For     of the depth A, f(q) = qo-
© let us assume that f is defined for each state of the depth d and lesser. The variable qp denotes the state of the depth d and Oj[c\p,a) = q''. Then we compute f{q') as follows:
while g{q, a) = faH do q := f{q);
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
• The cycle terminates, since Oj[c\p,a) 7^ fail.
• If the states q, r represent prefixes u, 1/ of some of the patterns from p, then f{q) = r     1/ is the longest proper suffix u.
Fetr Sojka FV030 Textual Information Systems
Petr Sojka FV030 Textual Information Systems
a We could use f as the backward state-transition function h, however, redundant backward transitions would be performed.
• We define function h inductively relative to the depth of the states this way:
• For V state q of the depth A, h(q) = Op.
a let us assume that h is defined for each state of the depth d and lesser. Let the depth qbe d + A. If the set of letters, for which is in a state f{q) the value of the function q different from fail, is the subset of the set of letters, for which is the value of the function gin a state q different from faii, then h(q) := h(f(q)), otherwise h(q) := f{q).
Petr Sojka PV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
Finite automata for searching
Deterministic ■finite automaton (DFA) M={K,T,6,o[p,Jr)
© K is a finite set of inner states. © T is a finite input alphabet. © 6 is a projection from K xT to K. ©     £ K is an initial state. © F C K is a set of final states.
Fetr Sojka FV030 Textual Information Systems
Search ofn patterns Ah o-Cora sick algorithm Finite automata for searching
Finite automata for searching
© Completely specified automaton if <5" is defined for every pair a) GKxT, otherwise incompletely specified automaton.
© Configuration M is a pair (cj, w), where q G K., w G T* \e the not
yet searched part of the text.
© An initial configuration M is (cjo. w), where w is the entire text to be searched.
© An accepting configuration M is [q, w), where q G F and    G T*.
Fetr Sojka FV030 Textual Information Systems
During the transition, a single input symbol is read and the engine switches to the next state p.
is" Transition M: is defined by a state and an input symbol; relation hC (K x T*) x (K x T*); if = p, then (q.aw) h (p, w) for
every Vtv £ T*.
is" The fcth power, transitive or more precisely transitive
reflexive closure of the relation h: \-k, h+, h*.
oar L(M) = {w eT* : [qo,w) h* {q,w') for some q G F, w' G T*} the language accepted by FA M.
is" time complexity 0(T) (we measure the number of transitions of FA M).
Petr Sojka PV030 Textual Information Systems
Search ofn patterns Aho-Corasick algorithm Finite automata for searching
Definition: Nondeterminietic ■finite automaton (NFA) is
M = (K, T, <?, qo, F), where K, T, qo, F are the same as those in the
deterministic version of FA, but 6 : K x T —> 2K <5{q,a) is now a set of
states.
Definition: he (K x T*) x (K x T*) transition: if p £ 5{q,a), then (q, atv) h (p, tv) for Vtv £ T*.
Definition: a final state, L(M) analogically as in DFA.
Fetr Sojka FV030 Textual Information Systems
Search of n patterns Ah o-Cora sick algorithm Finite automata for searching
Construction of SE (DFA) from NFA
Theorem: for every nondeterministic finite automaton M=(K,T,$,qo>F)> we can build deterministic finite automaton M'={K',T,6',c(0, f) such that L(M) = L(M').
Fetr Sojka FV030 Textual Information Systems
Search of n patterns Ah o-Cora sick algorithm Finite automata for searching
Construction of SE (DFA) from NFA (cont.)
A constructive proof [of the algorithm): Input: nondeterministic FA M = (K, T, c7, cjo. F)-Output: deterministic FA.
© K/={{^o}}> state {qo} in unmarked.
© If there are in K' all the states marked, continue to the step 4. © We choose from Kf unmarked state o[\
• 5'{q',a) = []{5{p,a)}for^p G o[ and a G T;
• K' = K' \J6'{q',a)for\/a G T;
• we mark cj' and continue to the step 2.
Fetr Sojka FV030 Textual Information Systems
Construction q( for SE for a set of patterns p = {i/1, i/2,..., vp}
© We create NFA M:
• An initial state qo-
9 for Va G T, we define g(^o» a) = ^o-
• For V/ G {-1,..., P}, we define      bj+i) = q', where q' is equivalent to the prefix       ■■■ bj+A of the pattern v'.
• The state corresponding to the entire pattern is the final one.
© ... and its corresponding DFA M' with q'.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods	
Derivation of a regular expression Characteristics of regular expressions	
	
Part VI
Search for an infinite set of patterns
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Definition: Regular expression E over the alphabet A:
® e,0 are RE and for Va £ A is a RE. © If x, y are RE over A, then:
• (x + y) 's (union);
• (x.y) is RE (concatenation);
• (x)* is RE (iteration).
A convention about priority of regular operations: union < concatenation < iteration.
Definition: Thereafter, we consider as a (generalized) regular
expression even those terms that do not contain, with regard to this convention, the unnecessary parentheses.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods
Derivation of a regular expression Characteristics of regular expressions
© h{0) = 0, h{s) = {e}, h{a) = {a}
©    • h(x + y) = h(x) U h(y) a h(x.y) = /i(x)./i(y) • h(x*) = (h(x))*
»3= f](X*) = 6 U X U x.x U x.x.x U . . .
The value of RE is a regular language (RL). «s" Every RL can be represented as RE. is- For V RE V 3 FA M: h{V) = L(M).
Petr Sojka PV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
A-1:	x + (y +	z) = (x + y) + z = x + y + z   associativity of union
A2:	x.(y.z) =	= (x.y).z = x.y.z   associativity of concatenation
A3:	x + y =	y + x   commutativity of union
A4:	(x + y).i	: = x.z + y.z   right distributivity
A5:	x.(y + z	) = x.y + x.z   left distributivity
A6:	x + x =	x   idempotence of union
A7:	6.x = x	identity element for concatenation
AO:	0.x = 0	inverse element for concatenation
A9:	x + 0 =	x   identity element for union
MO:	x* = s -+	x*x
MA:	X* = (žľ	4-x)*
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
© Summary of the previous lecture.
© Regular expressions, value of RE, characteristics.
© Derivation of regular expressions.
© Direct construction of equivalent DFA for given RE by derivation.
© Derivation of regular expressions by position vector.
© Right-to-left search (F3MH, CW, BUC).
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Theorem: the axiomatization of RE is complete and consistent.
Definition: regular expressions are termed as similar, when they can be mutually conversed using axioms AA to AAA.
theorem: similar regular expressions have the same value.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Definition: the length d{E) of the regular expression E:
© If E consists of one symbol, then d(E) = A. © d(Vi + Vz) = d{V^) + d{Vz) + -1. © d{VA Y2) = d{VA) + d(V2) + -1.
© d0/*) = d(l/) + -1. © d{{V)) = d{V) + 2. Note: the length corresponds to the syntax of a regular expression.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Construction of NFA for given RE
Definition: a generalized NFA allows e-transitions (transitions without reading of an input symbol).
Theorem: for every RE E, we can create FA M such that h{E) = L(M). Proof: by structural induction relative to the RE E:
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
n RE (a proof)
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Construction of NFA for given
RE (cont. of a proof)
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
E (cont.)
is" No more than two edges come out of every state.
is" No edges come out of the final states.
is" The number of the states M < 2 ■ d(E).
is" The simulation of automaton M is performed in 0(d(E)T) time and in 0(d(E)) space.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
For the following methods of NFA simulation, we must remove the 6-transitions. We can achieve it with the well-known procedure:
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
We represent a state with a Boolean vector and we pass through all the paths at the same time. There are two approaches:
is" The general algorithm that use a transition table.
is" Implementation of the automaton in a form of (generated) program for the particular automaton.
Fetr Sojka FV030 Textual Information Systems
r given RE
Let E is a RE over the alphabet T. Then we create FA M = (K, T, <?, cjo. F) such that h(E) = L(M) this way:
© We assign different natural numbers to all the occurrences of the symbols of T in the expression £. We get £'.
© A set of starting symbols Z = {x : a string of h(E') can start with the symbol x, x, ^ e}.
® A set of neighbours F = {x^j : symbols xf ^      y, can be next to each other in a string of h(E')}.
® A set of ending symbols F = {x> : a string of h(E') can end with the symbol X + e).
© A set of states K = {q0} UZU {y,- : xy- 6 F}. © A transition function 8:
• 8(qo, x) contains xf for.Vx 6 Z that originate from numbering of x.
• <?(Xj,y) contains yj for.Vxyj 6 F such thaty, originates from numbering of y.
® F is a set of final states, a state that corresponds to £ is q0.
Fetr Sojka FV030 Textual Information Systems
r given RE (cent.)
Example A: K = ab* a + ac + b*ab*. Example 2: R = ab* + ac + b*a.
Petr Sojka PV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Definition: derivation ^| of the regular expression E by a string x G T*:
dE _
©
For a£T, these statements are true:
de da	= 0
db	_ r
da	- í
d{E + F)	dE
da	da
d(E.F)	- J
da	' I
d(E')	dE
da	da
if a + b if a = b
dF da
dE dF
— • FH--
da da
da
■ F
ifeeh(E) otherwise
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
For x = a/|S2 • • • a„, a,- 6 T, these statements are true
dB _ d
dx    dan \dan-A
d í dB daz \da^
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Characteristics of regular expressions
Example: Derive B = fi + fi* + f*ifi by / and f. Example: Derive (o*sle)*cno by o, s, I, c and osle.
Theorem: h (^) = {y : xy G /i(E)}.
Example: Prove the above-mentioned statement. Instruction: use structural induction relative to E and x.
Definition: Regular expressions x, y are similar if one of them can be transformed to the other one with axioms of the axiomatic theory of RE (Salomaa).
Example: Is there a RE similar to E = fi + fi* + f*ifi that has length 7, -15?
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Direct construction of DFA for given RE (by RE derivation)
Brzozowski (1964, Journal of the ACM) Input: RE E over T.
Output: FA M = (K, T, 5, q0, F) such that h{E) = L(M). O Let us state Q = {E}, Q0 = {E}, / := 1.
O Let us create the derivation of all the expressions of Q;_i by all the symbols of T. Into Qh we insert all the expressions created by the derivation of the expressions of Q;_i that are not similar to the expressions of Q.
9 If Qj ^ 0, we insert (2; into Q, set / := / + 1 a move to the step 2.
Q For      GQandaG T, we set 5{^,a) =      in case that the expression ^ is similar to the expression (Concurrently
§ e a)
@ ThesetF={f GQ:6Gh(f)}.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Example: RE= R = (0 + 1)M. Q = Q0 = {(0 + 1)M},/ = 1 01 ={f =K,f } = {(0 + iri+6}
= (0 + 1)*1 +e} =0
Example: RE= (10)*(00)M.
For more, see Watson, F3. W.: A taxonomy of finite automata construction algorithms, Computing Science Note 93/43, Eindhoven University of Technology, The Netherlands, -1993. citeseer.ist.psu.edu/watson94taxonomy.html
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Example : let us have a set of the patterns P= {tie, ti, iti}: is" Create NFA that searches for P.
is" Create DFA that corresponds to this NFA and minimize it. Draw the transition graphs of both the automata (DFA and the minimal DFA) and describe the procedure of minimization.
is" Compare it to the result of the search engine SE.
is" Solve the exercise using the algorithm of direct construction of DFA (by deriving) and discuss whether the result automata are isomorphic.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Derivation of RE by position vector
Definition: Position vector is a set of numbers that correspond to the positions of those symbols of alphabet which can occur in the beginning of the tail of the string that is a part of the value of the given RE.
Example: let us have a regular expression:
a   .   b*    .   c (1) To denote the position, we are going to use the wedge symbol A. So the expression ('I) is represented as:
aA   .   b*   .   c (2)
By deriving a denoted expression, we get a new denoted regular expression. The basic rule of derivation is this:
Q If the operand, by which we derive, is denoted, then we denote the positions right after this operand. Subsequently, we remove its denotation. It means that, by deriving the expression (2) by the operand a, we get: a   .   b*    . c
A
(3a)
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Derivation of RE by position vector
Q Since the construction, which generates also the empty string, is denoted, we denote the following construction as well:
3   '   A    '   A Ob)
Now, by deriving by the operand b of the expression (3b), we get: a   .   b*   .   c (4a)
Q Since the construction following the construction in iteration is denoted, the previous constructions have to be also denoted.
a   '   A    '   A (4b)
By deriving the expression (4b) by the operand c, we get: a   .   b*   .   c a (5)
When a regular expression is denoted this way, it corresponds to the empty regular expression s.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Derivation of RE by position vector
w For every syntactic construction, we make a list of the starting positions at the initials of the members.
w If a construction symbol equals to the symbol we use for deriving, and it is located in the denoted position, then we move the denotation in front of the following position.
w If an iteration operator is located after the construction, and the denotation is at the end of the construction, then we append the list of the starting positions, which belong to this construction, to the resulting list.
w If the denotation is located before a construction, then we append the list of the starting positions of this construction to the resulting list.
w If the denotation is before the construction which generates also an empty string, then we append the list of the starting positions of the following construction to the resulting list.
w When we want to denote a construction Inside parentheses, we must denote all the initials of the members Inside the parentheses.
Fetr Sojka FV030 Textual Information Systems
Left-to-right methods Derivation of a regular expression Characteristics of regular expressions
Derivation of RE by position vector, an example
Example: a.b*.c, derived by a, b, c.
Fetr Sojka FV030 Textual Information Systems
Right-to-left search of one pattern
Part VII
Pight-to-left search
Fetr Sojka FV030 Textual Information Systems
Right-to-left search—principles.
Could the direction of the search be significant?
In which cases?
is" one pattern—Boyer-Moore (BM, 1977), Boyer-Moore-Horspool (BMH, A930), Boyer-Moore-Horspool-Sunday (BMHS, A990)
is" n patterns—Commentz-Walter (CW, 1979)
is" an infinite set of patterns: reversed regular expression—Buczitowski (BUC)
Petr Sojka PV030 Textual Information Systems
Right-to-left search of one pattern
1: var: TEXT: array[1..T] of char;
2:    PATTERN: array[1..P] of char; IJ: integer; FOUND: boolean;
3: FOUND := fa/se;    / := F;
4: while (/ < T) and not FOUND do
5:        J := 0;
6:       while (J < F) and (PATTERN[P - J] = TEXT[! - J]) do 7: J:=J + 1;
3:       end while
9:        FOUND := (J = F); 10:
11:       if not FOUND then
12: / := / + SHIFT(TEXT[I -J], J)
13:       end if
14: end while
SHIFT(A, J) = if A does not occur in the not yet compared part of the pattern then F - J else the smallest 0 < K < P such that PATTERN[P - (J + K)] = A;
Fetr Sojka FV030 Textual Information Systems
Right-to-left search of one pattern
When is it faster than KMP? When 0(T/P)? The time complexity 0(T + P).
Example: searching for the pattern BANANA in text I-WANT-TD-FLAVDR-NATURAL-BANANAS.
Fetr Sojka FV030 Textual Information Systems
The idea: AC + right-to-left search (F3M) [-1979]
const LMIN=/the length of the shortest pattern/ var TEXT: array [1..T] of char; I, J: integer;
FOUND: boolean;    STATE: TSTATE;
g: array [1..MAXSTATE,1..MAXSYMBOL]  of TSTATE;
F:  set of TSTATE; begin
FOUND:=FALSE;  STATE:=q0;  I:=LMIN; J:=0; while (K=T) & not (FOUND) do begin
if g[STATE, TEXT[I-J]]=fail then begin I:=I+SHIFT[STATE, TEXT [I-J]]; STATE:=q0; J:=0;
end
else begin STATE:=g[STATE, TEXT[I-J]];  J:=J+1 end FOUND:=STATE in F end end
Petr Sojka PV030 Textual Information Systems
Right-to-left search of one pattern
Construction of the CW search engine
INPUT: a set of patterns P = {i^,^.• • • >vk} OUTPUT: CW search engine
METHOD: we construct the function g and introduce the evaluation of the individual states w:
Q An initial state qo; w{qp) = s.
Q Each state of the search engine corresponds to the suffix bmbm+i ... t>„ of a pattern V\ of the set P. Let us define g{q,a) = q', where q' corresponds to the suffix abmbm+^ ... t>„ of a pattern V\\ w[q) = bn ...bm+-]bm; w[q') = w[q)a.
Q g{q,a) = fan for every q and a, for which g{q,a) was not defined in the step 2.
Q Bach state, that correspond to the full pattern, is a final one.
Fetr Sojka FV030 Textual Information Systems
Definition: 6hift[5T ATE, TEXT [\ - J]] = min {A, shift2(5TATE)},
where A = max {shiftA {STATE), char{TEXT[l -J])-J-A}.
The functions are defined this way:
Q char(a) is defined for all the symbols from the alphabet T as the least depth of a state, to that the CW search engine passes through a symbol a. If the symbol a is not in any pattern, then char(a) = LMIN + i, where LMIN is the length of the shortest pattern. Formally: char(a) = min -[LMIN + i, min{d(q)| w(q) = xa, x 6 T*}}.
Q Function sfr/fti(q0) =    for the other states, the value is
sfr/fti (q) = min {LMIN, A}, where A = min{k| k = d(cf) — d(q), where w(q) is its own suffix w(q') and a state q' has higher depth than q}.
Q Function shift2(q0) = LMIN; for the other states, the value is
shift2(q) = min{/\, B}, where A = min{k| k = d(cf) — d(q), where w(q) is a proper suffix tv(q') and    is a final state}, B = shift2(q')\q' is a predecessor of
Fetr Sojka FV030 Textual Information Systems
Example: P = {cacbaa, aba, acb, acbab, ccbab}.
LMIN = 3,
	a	b	c	X
char	A	A	2	4
	shiftA	sh/ft2
6	A	3
a	A	2
b	A	3
aa	3	2
ab	A	2
be	2	3
ba	A	
aab	3	2
aba	3	2
bca	2	2
bab	3	
aabc	3	2
babe	3	
aabca	3	2
babca	3	
b a be c	3	
aabcac	3	2
Petr Sojka PV030 Textual Information Systems
Right-to-left search of one pattern
© Right-to-left search of an infinite set of patterns
© Two-way jump automaton - a generalization of the so far learned
left-to-right and right-to-left algorithms. © Hierarchy of the exact search engines.
Fetr Sojka FV030 Textual Information Systems
Right-to-left se,	arch for an inf. set of patterns
	Generalization of SE
	Search engine hierarchy
Part VIII
Search for an infinite set of patterns
Fetr Sojka FV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
Right-to-left search for an inf
set of patterns
Definition: reversed regular expression is created by reversion of all concatenation in the expression.
Example: reversed RE for E = bc[a + a*bc) is E^ = (a + cba*)cb:
Fetr Sojka FV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
set of patterns (cent.)
Buczitowski: we search for E such that we create E^ and we use it for determination of shift[5TATE, SYMBOL] for each state and undefined transition analogically as in the CW algorithm:
	a	b	c	X
0		A		3 ■
A	A		A	2M
2		A		
3	A		A	A
4	A	A	A	A
5	A	A		A
Fetr Sojka FV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
Definition: 2DFAS is M = (Q, I, <?, cjo. k, 1. F). where Q a set of states £ an input alphabet
<? a projection. Q x I—» Q x {—1,1.....fe}
<|o£Q an initial state
k £ N max.. length of a jump
t^QuZ a jump symbol
FCQ a set of final states Definition: a configuration of2DFA9 is a string of Z*QZ* f I*. Definition: we denote a set of configurations 2DFA9 M as K(M).
Example:  a^ ...      qa,... a^ | aj...a„ 6 K(M):
Fetr Sojka FV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
Definition: a transition of2DFAS is a relation h C K(M) x K(M) such that
"s* a^ ... a,_^a, q a,+^... a^ | aj... an h a^ ... a,_^    a,a,+^... a^ | aj... an for / > A,d(q,a,+^) = (q', —A) (right-to-left comparison),
"a* a^ ... a,■ q a-i+A ... a^ | aj... anV a^... a,a,+^... a^-A q' "\ at ■ ■ ■ an for t5(q,ai+i) = (q',m),   m>A,   t = min{/' + m,n + A} (right-to-left jump),
"a* a^ ... aj q a^ ... a,_^ f a,... an h a^... a^a^A, ... a^-A q' T at ■ ■ ■ an for S(q,a{) = (q',m),   m>A,   t = min{f + m, n + A} (left-to-right jump),.
03* a^ ...a^ qaj...a,_^ f a,a,+^...an h a^ ...aj_^    aj...a,_^a, f ... an for / > -1, <?(^, a,) = (<^, -1) (left-to-right comparison),
(Left-to-right rules are for the left-to-right engines and vice versa.) Definition: \-k, h* analogically as in the SE.
Fetr Sojka FV030 Textual Information Systems
Definition: the language accepted by the two-way automaton
M = (Q, 1,5, q0, k, T, F) is a set L(M) = {w £ !* : q0 ] T h* v/fxw f, where f £ F,w' £ l*,x £ I}.
Theorem: L(M) for 2DKA5 M is regular.
Example: formulate a right-to-left search of the 'pattern BANANA in the text l-WANT-TO-FLAVOUR-NATURAL-BANANAS using BM as 2DFAS and trace the search as a sequence of configurations of the 2DFAS.
Petr Sojka PV030 Textual Information Systems
Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy
Let us have a regular expression R = -1(0 + -1*02) over the alphabet A = {0,-1,2}.
is" Construct a right-to-left DFA R (Buczilowski) and compute the failure function. Draw the transition graph of this automaton including the failure function visualization.
is" Express the resulting automaton as 2DFAS and trace searching in the text-1-120-10-12-102.
Fetr Sojka FV030 Textual Information Systems
Petr Sojka PV030 Textual Information Systems
Right-to-left se,	arch for an inf. set of patterns
	Generalization of SE
	Search engine hierarchy
© Fuzzy (proximity) search. Metrics for measurement of distance of strings.
© Classification of search: 6D space of search problems.
© Examples of creation of search engines.
© Completion of the chapter about searching without text preprocessing.
© Indexing basics.
Fetr Sojka FV030 Textual Information Systems
Fuzzy search: metrics Classification of search problems
Part IX
Proximity search
Fetr Sojka FV030 Textual Information Systems
fuzzy search: metrics Classification of search problems
How to measure (metrics) the similarity of strings?
Definition: we call d : SxS^R metrics if the following is true:
O d(x,y) >0
O d(x, x) = 0
Q d(x, y) = d(y, x) (symmetry) Q d(x, y) = 0     x = y (identity of indiscernibles) O d(x,y) + d(y,z) > d{x,z) [triangle inequality) We call the values of the function d (distance).
Fetr Sojka FV030 Textual Information Systems
fuzzy search: metrics Classification of search problems
Definition: let us have strings X and Y over the alphabet Z. The minimal number of editing operation for transformation X to Y is
is" Hamming distance, R-distance, when we allow just the operation Replace,
is" Levenshtein distance, DIR.--distance, when we allow the operations Delete, Insert and Replace,
is" Generalized Levenehtein distance, DIRT-distance, when we allow the operations Delete, Insert, Replace and "Transpose. Transposition is possible at the neighbouring characters only.
They are metrics, Hamming must be performed over strings of the same length, Levenshtein can be done over the different lengths.
Fetr Sojka FV030 Textual Information Systems
fuzzy search: metrics Classification of search problems
Example: Find such an example of strings X and Y, that simultaneously holds R(X,Y) = 5, D\K(X, Y) = 5, and D\KT(X, Y) = 5, or prove the non-existence of such strings.
Example: find such an example of strings X and Y, that holds simultaneously K(X, Y) = 5, D\K(X, Y) = 4, and D\KT(X, Y) = 3, or prove the non-existence of such strings.
Example: find such an example of strings X and Y of the length 2n, neN, that R(X, Y) = 2n and a) Y) = 2;   b) Y) =
Fetr Sojka FV030 Textual Information Systems
Fuzzy search: metrics SFOECO, QFOECO, SSOECO
Classification of search problems QSOECO, SFORCO, SFODCO
Classification of search problems
Definition: Let T =    t2 • • • tn and pattern P = p^p2 ■■ ■ pm. For example, we can aek:
Q is P a substring of T?
O is P a subsequence of T?
@ is a substring or a subsequence P in T?
O is P in T such that D{P, X) < k for k < m, where X = t,... tj is a
part of T (D is P, D/P or DIRT)? Q is a string P containing don't care symbol 0 (*) in T? © is a sequence of patterns P in T?
Furthermore, the variants for more patterns, plus instances of the search problem yes/no, the first occurrence, all the overlapping occurrences, all the also non-overlapping occurrences.
Fetr Sojka FV030 Textual Information Systems
Fetr Sojka FV030 Textual Information Systems
Fuzzy search: metrics SFOECO, QFOECO, SSOECO
Classification of search problems QSOECO, SFORCO, SFODCO
6D classification of search problems (cont.)
Dimension	A	2	3	4	5	6
	5	F	0	E	C	0
	Q	S	F	R	D	S
			1	D		
				G		
In total 2x2x3x4x2x2 = 192 search problems classified in a six-dimensional space.
For example, SFO??? denotes all the SE for search of one (entire) string.
For all these problems, we are going to learn how to create NFA for searching.
Fetr Sojka FV030 Textual Information Systems
Example: let P = p<] p2ps ■ ■ ■ pm, m = 4, A is any character of I.
NFA for SFOECO: A
Petr Sojka PV030 Textual Information Systems
Fuzzy search: metrics SFOECO, QFOECO, SSOECO
Classification of search problems QSOECO, SFORCO, SFODCO
Search for a sequence cf characters
Example: NFA for QFOECO (seQuence):
A p2 p3 p4
p is any character of I except for p. Automaton has m + A states for a pattern of the length m.
Fetr Sojka FV030 Textual Information Systems
Petr Sojka PV030 Textual Information Systems
Example: NFA for QSOECO is similar, we just add some cycles for non-matching characters and e transitions to all the existing forward transitions (or we concatenate the automaton m-times).
Definition: Automaton for QSOECO is called e-treelis.
Petr Sojka PV030 Textual Information Systems
Petr Sojka PV030 Textual Information Systems
Definition: This automaton is called R-treelie, and has
[m + A) + m + [m-A)-\----+ [m - k + A) = [k + A)[m + A - ^) states.
The number of the level of the final state corresponds to the length of the found string from the pattern.
Petr Sojka FV030 Textual Information Systems
Fuzzy search: metrics SFOECO, QFOECO, SSOECO
Classification of search problems QSOECO, SFORCO, SFODCO
Fetr Sojka FV030 Textual Information Systems
Fuzzy search: metrics SFOECO, QFOECO, SSOECO
Classification of search problems QSOECO, SFORCO, SFODCO
SFOGCO
For the DIRT-distance, we add more new states to the SFODCO automaton that correspond to the operation of transposition and also the corresponding pair of edges for every transposition.
Animation program by Mr. Pojerforthe discussed search automata is available for download from the course web page and is also installed in
Simulation of NFA or determinisation? A hybrid approach.
The Prague Stringology Club and its conference series: see http://www.stringology.org/.
Fetr Sojka FV030 Textual Information Systems
Fuzzy 2 Classification of
SFOECO, QFOECO, SSOECO O, SFORCO, SFODCO
© Searching with text preprocessing; indexing methods.
© Methods of indexing.
© Automatic indexing, thesaurus construction.
© Ways of index implementation.
Petr Sojka FV030 Textual Information Systems
Part X
Indexing Methods
Petr Sojka PV030 Textual Information Systems
Large amount of texts? The text preprocessing!
is" Index, indexing methods, indexing file, indexsequential file.
is" Hierarchical structuring of text, tagging of text, hypertext.
is" Questions of word list storing (lexicon) and occurrence (hit) list storing, their updating.
Petr Sojka PV030 Textual Information Systems
is" granularity of the index items: document - paragraph - sentence - word
word-1    word2   word3 word4 docA    A A 0 A
doc2   0 A A A
doc5   A 0 A A
is" inverted file, transposition
docA    doc2 doc5 wordA    A        0 A word2   A        A 0 word5   0        A A word4 AAA
Fetr Sojka FV030 Textual Information Systems
is" Word order (primary key) in index —> binary search
Time complexity of one word searching in index: n index length, V pattern length 0(V x log2(n))
is" searching for k words, pattern p =   ,..., K<n=^ repeated binary search
s average pattern length, complexity? 0(s x k x log2 n)
is" As long as k and i are comparable: double dictionary method. nsr Hashing.
However the speed 0(n) even 0(log n) isn't usually sufficient, 0(A) is needed.
Petr Sojka PV030 Textual Information Systems
■ems I
An appropriate choice of data structures and algorithms is for the index implementation crucial.
is" Use of inverted file:
word A	-10-1
word 2	-1-10
word3	0-1-1
word 4	-1-1-1
is" Use of document list:
wordA	A,5
word2	A, 2
wordS	2,3
wordA	-1,2,3
is" Coordinate system with pointers has 2 parts: a dictionary with pointers to the document list and a linked list of pointers to documents.
Fetr Sojka FV030 Textual Information Systems
is" manual vs. automatic, pros/cons
«3" etop-liet (words with grammatical meaning - conjunctions, prepositions,...) O not-driven
© driven (a special dictionary of words: indexing language assessment) - paee-liet, thesaurus, is" synonyms and related words.
is" inflective languages: creating of registry with language support-iemmatization.
Petr Sojka PV030 Textual Information Systems
s for index
Frequency of word occurrences is for document identification significant, English frequency dictionary:
1	the	69971	0.070	6	in	21341	OA 23
2	of	36411	0.073	7	that	10595	0.074
3	and	2ÔÔ52	0.0Ô6	ô	is	A0099	O.OÔÔ
4	to	26149	0.104	9	was	9&A6	O.OÔÔ
5	a	23237	0.116	10	he	9543	0.095
"a- Zipf e law (principle of least resistance) order x frequency = constant
"a- Cumulative proportion of used words CPW -■
lorder=i frequency>order
text words count
"a- The rule 20-30: 20% of the most frequent words make 30% of text
[MEL, fig. 4.19].
Petr Sojka PV030 Textual Information Systems
Automatic indexing method is based on word significance derivation from word frequencies (cf. Collins-Cobuild dictionary); words with low and high frequency are cut out: INPUT: n documents
OUTPUT: a list of words suitable for an index creation
O We calculate a frequency for every document i £ {A,n) and
every word k £ {A,K) [K is a count of different words in all documents].
9 We calculate TOTFREQk = ^"=A FKEQk.
Q We create a frequency dictionary for the words k £ {A,K).
Q We set down a threshold for an exclusion of very frequent words.
Q We set down a threshold for an exclusion of words with a low frequency.
Q We insert the remaining words to the index.
Questions of threshold determination [MEL, fig. 4.20].
Petr Sojka PV030 Textual Information Systems
is" Excursus to the computational linguistics, is" Corpus linguistics as an TIS example.
is" Search methods with preprocessing of text and pattern (query).
Petr Sojka FV030 Textual Information Systems
Morphology utilization for creating of dictionary
is" stem/ root of words (ucit, uc);
is" program ajka (abin),
http: //nip. f i.muni. cz/pro j ekty/ajka/ examples;
is" a techniques of patterns for stem determination;
Petr Sojka FV030 Textual Information Systems
is" Thesaurus - a dictionary, containing hierarchical and associative relations and relations of equivalence between particular terms.
is" Relations between terms/lemmas:
• synonyms - relation to a standard term; e.g. „see";
• relation to a related term (RT); e.g. „see also";
• relation to a broader term (BT);
• relation to a narrower term (NT);
• hypernyms (canmeans of transport); hyponyms (birdyay); meronym (doonlock); holonyms (hand:body); antonyms [qood:bad).
is" Dog/Fik, Havel/president
Petr Sojka PV030 Textual Information Systems
manually/ half-automatically is" heuristics of thesaurus construction:
• hierarchical structure/s of thesaurus
a field thesauri, the semantics is context-dependent {e.g.
field, tree in Informatics) » compounding of terms with a similar frequency
• exclusion of terms with a high frequency
is" breadth of application of thesaurus and lemmatizer: besides of spelling indexing, base of grammar checker, fulltext search.
«s- projekts WORDNET, EUROWORDNET
is" module add wordnet; wn
wn faculty -over -simsn -coorn
Petr Sojka PV030 Textual Information Systems
is" Knowledge base creation for exact evaluation of document relevance.
is" topic - processing of semantic maps of terms Visual Thesaurus http://www.visualthesaurus.com.
is- Tovek Tools, Verity.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Part XI
Excursus to the Computational Linguistics
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" string searching - words are strings of letters, is" word-forming - morphological analysis. is" grammar (CFG, DFG) - syntactic analysis, is" meaning of sentences (TIL) - semantic analysis, is" context - pragmatic analysis.
is" full understanding and communication ability - information.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
basic queries • „Havel";
45:   Český prezident Václav <Havel> se včera na 09: jak    řekl        Václav <Havel> ,    každý občan 24Ô: více   než        rokem <Havel> řekl Pravda vítězí
regular expressions
• „Pravda I pravda";
• „(P|p)ravda";
• „(P|p)ravd[a,u,o,y]";
• „pravd.*"; „pravd.+"; „post?el"; word sequence
• „prezident(a|u)" „Havl(a|ovi)";
• „a tak";
• „prezident"; []* „Havel";
• „prezident" („republiky" „Václav")? „Havel";
Petr Sojka PV030 Textual Information Systems
Index System Implementation
queries for positional attributes
• [word = „Havel"];
• [lemma = „prezident"] []* [lemma = „Havel"];
• ... ženu prezidenta Havla ...
[lemma = „hnát"] [] [lemma = „Havel"];
• [word = „žen(u|eme)" & lemma !=„žena"]; I ... or ! ... not
some other possibilities
• [lemma = „prezident"] []* [lemma = „Havel"] within 6; ... 10, 3 6
• [lemma = „Havel"] within 20 </s>„Pravda"
• <s>a;[word= „Žena|Muž|Clověk"] []* [lemma = a.lemma]
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Large computational power of today's computers enables:
• efficient storing of large amount of text data (compression, indexing);
• efficient search for text strings.
A man sitting behind a computer uses all this, to obtain from so processed documents information, that he is interested. Really?
Example: In text database there is stored a few last years of daily newspaper. I'd like
to obtain information about president Vaclav Havel.
a/>HAVEL
b/>more precise queries
Computer (computational power)
human (inteligence)
valuable information
The goal of
+ time + money
everybody —> is to transfer the largest possible part of intelligence (time, money, to computer.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
information	ideal of ideals	no	Searching	
pragmatic analysis	context	no	information	Correct
semantic analysis	sentence meaning TIL	starting-up	Spell	translation
syntactic analysis	grammar CFG, DCG	partially	check	
morphological analysis	word-forming lemma	yes	Check	Simple translat
words are strings of letters	string searching	yes		
Petr Sojka PV030 Textual Information Systems
Index System Implementation
on from natural language
Do we really know, what is information contained in the text in natural language?
• František Novák weighs i 00 kg
object   property value
• František Novák likes beer. František Novák likes Jana Novotná
• F. N. is an old honest man. Spring erupted in full force.
Words of the natural language denote objects, their properties and relations between them. It's possible to see the words and sentences also as ..functions" of its kind, defined by their meaning.
• A man, who climbed a highest Czech eight-thousander, is my grandson.
—> RDB
attribute, attribut2, ? key value
?
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" Corpus: electronic collection of texts, often indexed by linguistic tags.
is" Corpus as a text information system: corpus linguistics.
is" BNC, Penn Treebank, DESAM, PNK,...; ranges from millions to billion positions (words), special methods necessary.
is" Corpus managers CQP, GCQ.P, Manatee/Bonito, http://www.f i.muni.cz/~pary/
see [MAR].
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Definition: Corpus is a large, internaly structured compact file of texts in natural language electronically stored and processable.
• Indian languages have no script - for a finding of a grammar it's necessary to write up the spoken word.
a -1967 - A. corpus in U. 5. A. (Kučera, Francis) A 000000 words.
a Noam Chomsky - refuses corpora.
• Today - massive expansion.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
• WWW page of Pavel Rychlý (~pary) links to basic information. Bonito, Manatee.
• IMS CORPUS WORKBENCH - a toolkit for efficient representation and querying over large text files.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Sequence of words at numbered positions (first word, nth word), to which tags are added (addition of tags called corpus tagging). Tags are morphological, grammatical and any other information about a given word. It leads to more general concept of position attributes, those are the most Important tagging type. Attributes of this class have a value (string) at every corpus position. To every of them one word is linked as a basic and positional attribute word. In addition to this attribute, further position attributes may be bundled with each position of any text, representing the morphological and other tags.
Structural attributes - sentences, paragraphs, title, article, SGML.
P0S2 POS-1
LEMMA     český     prezident    vaclav        navel dnes
~ -107
WORD     Českého   prezidenta   Václava       Havla dnes I_I_I_I_I
0 12 3 4
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Two key terms of internal representation of position attributes are:
• Uniform representation: items for all attributes are encoded as integer numbers, where the same values have the same digital code. A sequence of items is then represented as a sequence of integers. Internal representation of attribute word (as well as of any other pos. attribute) is array (0. .p-1) of Integer, where p is position count of corpus.
• Inverted file: for a sequence of numbers representing a sequence of values of a given attribute, the inverted file is created. This file contains a set of occurrences in position attribute for every value (better value code). Inverted file is needed for searching, because it directly shows a set of occurrences of a given item, the occurrences then can be counted in one step.
Fetr Sojka FV030 Textual Information Systems
Index System Implementation
File with encoded attribute values and inverted file as well have auxiliary files.
• The first data structure is a list of items or Jexicon": it
contains a set of different values. Internally it's a set of strings occurring in the sequence of items, where a symbol Null (octal 000) is inserted behind every word. The list of items already defines a code for every item, because we suppose the first item in the list to have a code 0, following A etc.
Petr Sojka PV030 Textual Information Systems
There are three data structures for the inverted file:
• The first is an independent inverted file, that contains a set of corpus positions.
• The second is an index of this file. This index returns for every code of item an input point belonging to an occurrence in inverted file.
• The third is a table of frequency of item code, which for each item code gives a number of code occurrence in corpus (that is of course the same as the size of occurrence set).
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Preprocessing of text and pattern (query): overwhelming majority of today's TIS. Types of preprocessing:
is" n-gram statistics (fragment indexes).
is" special algorithms for indexes processing (coding, compression) and relevance evaluation (PageRank Google)
is" usage of natural language processing methods (morphology, syntactic analysis, semantic databases) an aggregation of information from multiple sources (systems AnswerBus, START).
is" signature methods.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Definition: Relevance (of answers to a query) is a rate range, by which a selected document coincides with requirements imposed on it. Ideal answer = real answer
Definition: Coefficient of completeness (recall) R = ^, where m is a count of selected relevant records and n is a count of all relevant records in TIS.
Definition: Coefficient of precision P = ^, where o is count of all
selected records by a query.
We want to achieve maximum P and P, tradeoff.
Standard values: £0% for P, 20% for P.
Combination of completeness and precision:
coefficient řb = ■ (Fo = P,FC0 = R, where FA = F P and R
weighted equally).
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" The fragment ybd is in English only in the word molybdenum.
is" Advantages: fixed dictionary, no problems with updates.
is" Disadvantages: language dependency and thematic area, decreased precision of search.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" Google as an example of web-scale information system.
is" Jeff Dean's video - historical notes of Google search developments.
is" Google - system architecture.
is" Google - PageRank.
is" Google File System.
is" Implementation of index systems
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Gooooooooooooooq\e - a bit of history
An example of anatomy of global (hyper)text information system (www.google.com).
is" -1997: google.stanford.edu, students Page and Brin
is" A99&: one of few quality search engines, whose basic
fundamentals and architecture (or at least their principles) are known - therefore a more detailed analysis according to the article [GOO]
http://www7.conf.au/programme/fullpapers/1921coml921.htm.
is" 20-12: clear leader in global web search
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" Several innovative concepts: PageRank, storing of local compressed archive, calculation of relevance from texts of hypertext links, PDF indexing and other formats, Google File System, Google Link...
is" The system anatomy, see [MAR]
Petr Sojka PV030 Textual Information Systems
Index System Implementation
The crucial thing is documents' relevance (credit) computation.
is" Usage of tags of text and web typography for the relevance calculation of document terms.
is" Usage of text of hyperlink is referring to the document.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" PageRank: objective measure of page importance based on citation analysis (suitable for ordering of answers for queries, namely page relevance computation).
is" Let pages 7^,.. .,Tn (citations) point to a page A, total sum of pages is m. PageRank
PK(A) =
0
m V C(7V,)
+ ...
p£(t„;
C(Tn)
is" PageRank can be calculated by a simple iterative algorithm (for tens of millions of pages in hours on a normal PC).
is" PageRank is a probability distribution over web pages.
is" PageRank is not the only applied factor, but coefficient of more factors. A motivation with a random surfer, dumping factor d, usually around 0.(35.
FV030 Textual Information Systei
Index System Implementation
is" Storing of file signatures
is" Storing of lexicon
is" Storing of hit list,
is" Google File System
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" Inverted file - indexing file with a bit vector.
is" Usage of document list to every key word.
is" Coordinate system with pointers [MEL, fig. 4/12>, page 46].
is" Indexing of corpus texts: Finlib
http://www.fi.muni.cz/~pary/dis.pdf see[MAR].
is" Use of Elias coding for a compression of hit list.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
it.
is" Efficient storing of index/dictionary [lemmas]: packed the, Patricia tree, and other tree structures.
is" Syntactic neural network (S. M. Lucas: Rapid best-first retrieval from massive dictionaries, Pattern Recognition Letters 17, p. 1507-15-12,1996).
is" Commercial implementations: Verity engine, most of web search engines - with few exceptions - hide their key to success.
Petr Sojka PV030 Textual Information Systems
Index System Implementation
Article M. Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing see [MAP]
is" Dictionary representation by finite automaton.
is" Ambiguities, unification of minimized deterministic automata.
is" Example: done,do.V3:PP done,done.AO
is" Morphological dictionary as a list of pairs [word form, lemma].
is" Compaction of storing of data structure of automata (Liang, -1983).
is" Compression ratio up to -1:20 in the linear approach (given the length of word).
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" Transducer for dictionary representation.
is" Deterministic transducer with -1 output (subsequential
transducer) for dictionary representation including one string on output [Information about morphology, hyphenation,...).
is" Deterministic transducer with p outputs (p—subsequential
transducer) for dictionary representation including more strings on output (ambiguities).
is" Determinization of the transducer generally unrealizable (the class of deterministic transducers with an output is a proper subclass of nondeterministic transducers); for purposes of natural language processing, though, usually doesn't occur (there aren't cycles).
Petr Sojka PV030 Textual Information Systems
Index System Implementation
is" An addition of a state to a transducer corresponding (wi.w^) without breaking the deterministic property: first a state for [w-],e), then with resulting state final state with output W2-
is" Efficient method, quick, however not minimal; there are minimizing algorithms, that lead to spatially economical solutions.
is" Procedure: splitting of dictionary, creation of det. transducers with p outputs, their minimization, then a deterministic unification of transducers and minimizing the resulting.
is" Another use also for the efficient indexing, speech recognition, etc.
Petr Sojka PV030 Textual Information Systems
Part XII
Coding
Petr Sojka PV030 Textual Information Systems
■sr Coding.
vs- Entropy, redundancy.
is" Universal coding of the integers.
vs- Huffman coding.
Adaptive Huffman coding.
Petr Sojka PV030 Textual Information Systems
Definition: Alphabet A is a finite nonempty set of symbols. Definition: Word [string, message) over A is a sequence of symbols from A.
Definition: Empty string e is an empty sequence of symbols. A set of nonempty words over A is labeled A+.
Definition: Code K is a triad (5, C, f), where 5 is finite set of source units, C is finite set of code units, f : 5 —> C+ is an injective mapping. f can be expanded to S+ —> C+: F(SiS2 ... S^) = f[5^)f{52) ■ ■ ■ f[5k). C+ is sometimes called code.
Petr Sojka PV030 Textual Information Systems
Definition: x G C+ is uniquely decodable regarding f, if there is maximum one sequence y G S+ so, that f(y) = x. Definition: Code K = (5, C, f) is uniquely decodable if all strings in C+ are uniquely decodable.
Definition: A code is called a prefix one, if no code word is a prefix of another.
Definition: A code is called a suffix one, if no code word is a suffix of another.
Definition: A code is called a affix one, if it is prefix and suffix code. Definition: A code is called a full one, if after adding of any additional code word a code arises, that isn't uniquely decodable.
Petr Sojka PV030 Textual Information Systems
Definition: Block code of length n is such a code, in which all code words have length n. Example: block    ? prefix block     prefix, but not vice versa.
Definition: A code K = (5,C,f) is called binary, if |C| = 2.
Petr Sojka PV030 Textual Information Systems
Definition: Compression (coding), decompression (decoding):
_^ Compression
(encoding)
original data
^_ Decompression ^_
(decoding)
Definition: Compression ratio is a ratio of length of compressed data and length of original data.
Example: Suggest a binary prefix code for decimal digits, if there are often numbers 5 a 4, and rarely 5 and 6.
compressed data
Petr Sojka PV030 Textual Information Systems
Let Y be a random variable with a probability distribution
p(y) = P(Y = y). Then the mathematical expectation (mean rate)
e(v) = £ yp(y)-
yer
Let 5 = {x/|,X2,..., xn} be a set of source units and let the occurrence probability of unit x,- in information source S is p, for i = A,..., n, n e N.
Definition: Entropy of information content of unit x,- (measure of amount of information or uncertainty) is H(x,-) = H,- = — log2 p,- bits. A source unit with more probability bears less information.
Petr Sojka PV030 Textual Information Systems
Definition: Entropy of information sources is H(S) = — ^ p,- log2 p; bits.
True, that H(S) = J p(y) log — = E (jog —) Definition: Entropy of source message X = x-h x,-2 ... x-,k € 5+ of
k k
information sources is H(X, S) = H(X) = ^ H,- = — ^ log2 p/y bits. Definition: Length l(X) of encoded message X
k k
/(X) = £|f(x0)| = £^.bits. Theorem: /(X) > H(X,S).
Petr Sojka FV030 Textual Information Systems
Axiomatic introduction of entropy see [MAR], details of derivation see ftp://www.math.muni.cz/pub/math/people/Paseka/lectures/kodovani.ps
Definition:       = /(X) - H(X) = ^{djj + log2 p,j) is redundancy of code K for message X.
n
Definition: Average length of code word K is AL(K) = ^p;d; bits. Definition: Average length of source S is AE(S) =     p,H, = -    p log2 P/ bits. Definition: Average redundancy of code K is
n
A/?(K) = AL(K) - AE(S) = £ p,(d, + log2 p) bits.
;=1
;=1
Petr Sojka FV030 Textual Information Systems
Definition: A code is an optimal one, if it has minimal redundancy. Definition: A code is an asymptotically optimal, if for a given distribution of probabilities the ratio AL(K)/AE(S) is close to A, while the entropy is close to co.
Definition: A code K is a universal one, if there are c^,C2 & R so, that
average length of code word AL(K) <     x AE + C2-
Theorem: Universal code is asymptotically optimal, if    = A.
Petr Sojka PV030 Textual Information Systems
Definition: Fibonacci sequence of order m
F„ = Fn-m + + . . . + Fn^ fom>A.
Example: F of order 2: F_/| = 0„ fo = "1, Fi = 1, F2 = 2, F3 = 3, F4 = 5, F5 =    ..
Example: F of order 3: F_2 = 0, F_/| = 0, Fo = -1, F^ = -1, F2 = 2, F3=4, F4 = 7,F5 = 13,...
Example: F of order 4: F_3 = 0, F_2 = 0, F_^ = 0, Fo = A,     = A, F2 = 2, F3 = 4,F4 = &,F5 = 15,...
Definition: Fibonacci representation R(N) = X/=i ^'F, where 4- G {0, A }, dk = A
Theorem: Fibonacci representation is ambiguous, however there is such a one, that has at most m — A consecutive ones in a sequence d\.
Petr Sojka PV030 Textual Information Systems
Definition: Fibonacci code of order m FKm(N) = d^ d2 ... d^ A ... A ,
m—1 krát
where d\ are coefficients from previous sentence [ones end a word). Example: í?(32) =0*A+0*2 + A*5 + 0*5 + A*& + 0*A5 + A*2A, thusF(32) = 00-10-10-1-1.
Theorem: FK(2) is a prefix, universal code with ci = 2, c2 = 3, thus it isn't asymptotically optimal.
Petr Sojka PV030 Textual Information Systems
gers II
ns- unary code a(N) = 00... OA.
na- binary code/3(-1) = A,P(2N + j) = P{N)j, j = 0,A.
"a- p is not uniquely decodable (it isn't prefix code),
us- ternary r(N) = /3(N)#.
is- /3'(-1) = 6, P'(2N) = P'{N)0, P'(2N + A) = P'{N)A, t'(N) = P'(N)#.
"a- y. every bit P'(N) is inserted between a pair from a(\P(N)\).
us- example: |/(6) = OAOOA
us- CK = {|/(N) : N > 0} = (0{0, A})* A is regular and therefore it's decodable by finite automaton.
Petr Sojka PV030 Textual Information Systems
is- y/(N) = a{\P{H)\)P'[H) the same length (bit permutation |/(N)),
but more readable is- c> = {/{N) :N>0} = {&A{0,A}k :k>0}\s not regular and
the decoder needs a counter
« 5(H) = Y(\m\ww
is- example: <5"(4) = y(3)00 = OAAOO is- decoder <5": <?(?) = OOAA? osr to: K := 0;
while Llo02(N)J > 0 do begin K := £(N)K;
N := l\og2(N)]
end.
Petr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" Information encoding for communication purposes.
is" Despite tumultuous evolution of capacities for data storage, there is still a lack of space, or access to compressed data saves time. Redundancy —> a construction of a minimal redundant code.
is" Data model:
• structure - a set of units to compression + context of occurrences;
a parameters - occurrence probability of particular units.
• data model creation; a the actual encoding.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" A333 Morse, code e by frequency.
is" -1949 Shannon, Fano, Weaver.
is" -1952 Huffman; 5 bits per character.
is" -1979 Ziv-Lempel; compress (Poden, Welsh, Bell, Knuth, Miller, Wegman, Fiala, Green,...); 4 bits per character.
<sr eighties and nineties PPM, DMC, gzip (zlib), SAKDC; 2-3 bits/character
is" at the turn of millenium bzip2; 2 bits per character.
is"
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Evolution of compression algorithms
m 4 .
• Huffman
LZ77
LZ7Ô
• compress
DMC .
GZip
PPM»
SAKDC
-1950 -I960
-1970 -19Ô0 YEAR
-1990 2000
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" redundancy (non-uniform probability of source unit occurrences) is" encoder, decoder, model
is" statistical modeling (the model doesn't depend on concrete data)
is" semiadaptive modeling (the model depends on data, 2 passes, necessity of model transfer)
is" adaptive modeling (only one pass, the model is created dynamically by both encoder and decoder)
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" models of order 0 - probabilities of isolated source units (e.g. Morse, character e)
is" models with a finite context - Markov models, models of order n
(e.g. Bach), P(a|x^X2 ... xn) is" models based on finite automata
• synchronization string, nonsynchronization string
• automaton with a finite context
• suitable for regular languages, unsuitable for context-free languages, P(a|q;)
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
iw Huffman coding.
us" Adaptive Huffman coding,
i®- Aritmetic coding,
ts" Dictionary methods.
B3° Signature methods.
is" Similarity of documents,
Compression using neural networks.
Petr Sojka PV030 Textual Information Systems
Character techniques
is" null suppression - replacement of repetition > 2 of character null, 255, special character 5C
is" run-length encoding (RLE) - 5CXCC generalization to any repetitious character *55 —> $SC * 655
«s- MNP Class 5 RLE - CXXX DDDDDBBAAAA -> 5DDDBB4AAA
«s- half-byte packing, (EBCDIC, ASCII) SI, SO
is" diatomic encoding; replacement of character pairs with one character.
is- Byte Pair Encoding, BPE (Gage, -1994)
is" pattern substitution
is" Gilbert Held: Data & Image Compression
Petr Sojka PV030 Textual Information Systems
is" Shannon-Fano, -1949, model of order 0,
is" code words of length [— log2 pj or [— log2p; + -1J
«s- AE < AL < AE + A.
is- code tree (2,2,2,2,4,4,8).
is" generally it is not optimal, two passes of encoder through text, static —>x
Petr Sojka PV030 Textual Information Systems
Input: a sequence of n source units Sp], -1 < / < n, in order of nondecreasing probabilities.
Output: n binary code words.
begin assign to all code words an empty string; SF-SPUT(S)
end
procedure SF-SPLIT(S); begin if |S| > 2 then
begin divide S to sequences 5A and 52 so, that both
sequences have roughly the same total probability; add to all code words from 5A 0; add to all code words from 52 -1; SF-SPUT(S-I); SF-SPUT(S2);
end
end
Petr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" Huffman coding, -1952.
is" static and dynamic variants.
is" optimal code (not the only possible).
is" 0(n) assuming ordination of source units.
is" stable distribution —> preparation in advance.
Example: (2,2,2,2,4,4,8-)
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Definition: Binary tree have a sibling property if and only if O each node except the root has a sibling,
Q nodes can be arranged in order of nondecreasing sequence so, that each node (except the root) adjacent in the list with another node, is his sibling (the left sons are on the odd positions in the list and the right ones on even).
Fetr Sojka FV030 Textual Information Systems
uffman trees
Theorem: A binary prefix code is a Huffman one    it has the sibling property.
is" 2n — A nodes, max. 2n — A possibilities, is" optimal binary prefix code, that is not the Huffman one. is" AK{X) < pn + 0,086, pn maximum probability of source unit, is" Huffman is a full code, (poor error detection).
is" possible to extend to an affix code, KWIC, left and right context, searching for *X*.
Petr Sojka PV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is- FGK (Faller, Gallager, Knuth)
is" suppression of the past by coefficient of forgetting, rounding, -1,
is" linear time of coding and decoding regarding the word length.
«a= ALHD < 2ALHS-
is- Vitter ALhd < ALHS + A.
is" implementation details, tree representation code tables.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" generalization of Huffman coding (probabilities of source units needn't be negative powers of two).
is" order of source units; Cumulative probability cp-, = Xj=i Pj
source units x,- with probability p,-.
is" Advantages:
• any proximity to entropy.
• adaptability is possible.
• speed.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
preeeion
Definition: Dictionary is a pair D = (M, C), where M is a finite set of words of source language, C mapping M to the set of code words. Definition: L(m) denotes the length of code word C(m) in bits, for m £ M.
Selection of source units:
• static (agreement on the dictionary in advance)
• semiadaptive (necessary two passes trough text)
• adaptive
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Source unit of the length n - n-grams Most often bigrams (n = 2)
• n fixed
a n variable (by frequency of occurrence)
• adaptive
(50 % of an English text consits of about A 50 most frequent words) Disadvantages:
• they are unable to react to the probability distribution of compressed data
• pre-prepared dictionary
Fetr Sojka FV030 Textual Information Systems
Dictionary
Compressed data
Compressed dictionary   Compressed data
Advantages: extensive date (the dictionary is a small part of data corpora; CQP).
Petr Sojka PV030 Textual Information Systems
HufFman coding Adaptive dictionary methods	
Semiadaptive dictionary meth procedure	ode - dictionary creation
O The frequency of N-grams is determined for N = 1,2,....
@ The dictionary is initialized by unigram insertion.
Q N-grams with the highest frequency are gradually added to the dictionary. During K-gram insertion frequencies decrease for it's
components of (K — -1)-grams, (K — 2)-grams____If, by reducing
of frequencies, a frequency of a component is greatly reduced, then it's excluded from the dictionary.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" Adaptive dictionary methods with dictionary restructuring.
is" Syntactic methods.
is" Checking of text correctness.
is" Querying and TIS models.
is" Vector model of documents
is" Automatic text structuring.
is" Document similarity.
Fetr Sojka FV030 Textual Information Systems
LZ77 - siliding window methods
LZ7& - methods of Increasing dictionary
a	b	c	b	a	b	b	a	a	b	a	c	b
encoded part not enc. part
(window, N < &A92) (|B| ~10-20b)
In the encoded part the longest prefix P of a string in not encoded part is searched. If such a string is found, then P is encoded using (/, J, A), where / is a distance of first character 3 from the border, J is a length of the string S and A is a first character behind the prefix P. The window is shifted by J + A characters right. If the substring S wasn't found, then a triple (0,0, A) is created, where A is a first character of not encoded part.
Petr Sojka PV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
|M| = (N - 3) x 3 x t, t size of alphabet L(m) = pog2(N-B)1 + po^l + P^l Advantage: the search of the longest prefix [KMP]
• LZR uses a tree containing all the prefixes in the yet encoded part.
• The whole encoded yet encoded part is used as a dictionary.
• Because the / in (/, j, a) can be large, the Elias code for coding of the integers is used.
Disadvantage: a growth of the tree size without any limitation => after exceeding of defined memory it's deleted and the construction starts from the beginning.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
The code is a sequence of pointers and characters. The pointer (/, j) needs a memory as p characters    a pointer only, when it pays off, but there is a bit needed to distinguish a character from a pointer. The count of dictionary items is |M| = t + (N - 3) x (B - p) (considering only substrings longer than p). The bit count to encode is
• L(m) = A + \log21"| for m G T
• L(m) = -1 + flog2N"| + [log2(B - p)] otherways. (The length d of substring can be represented as 3 — p).
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
A pointer (/, j) (analogy to LZSS) If
• the window is not full (at the beginning) and
• the compressed text is shorter than N,
the usage of log2 N bytes for encoding of / is a waste. LZF3 uses phasing for binary coding. - prefix code with increasing count of bits for increasing values of numbers. Elias code y.
LZSS, where for pointer encoding the Huffman coding is used (i.e. by distribution of their probabilities => 2 throughpasses)
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
The main idea: the dictionary contains phrases. A new phrase so, that an already existing phrase is extended by a symbol. A phrase is encoded by an index of the prefix and by the added symbol.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Input	a	b	ab	c	ba
Index	-1	2	3	4	5
Output	(Oa)	(0,b)	(1.b)	(0,c)	(2,a)
Input	bab	aa		3333
Index	6	7	Ô	9
Output	(5,b)		(7,a)	(ô,a)
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
A dictionary is stored in a tree structure, edges are labeled with strings of characters. These strings are in the window and each node of the tree contains a pointer to the window and identifying symbols on the path from the root to the node.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
LZW (Welch), LZC
The output indexes are only, or
• the dictionary is initiated by items for all input symbols
• the last symbol of each phrase is the first symbol of the following phrase.
Input    ababcbababaa    a    a a
Index        4 5      6 7      & 9 AO
Output A 2      4 3      5 31        -10 AA
Overflow => next phrase is not transmitted and coding continues statically, it's a LZW +
• Pointers are encoded with prolonging length.
• Once the compression ratio will decrease, dictionary will be deleted and it starts from the beginning.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
LZT, LZMW, LZJ
As LZC, but when a dictionary overflows, phrases, that were least used in the recent past, are excluded from the dictionary. It uses phrasing for binary coding of phrase indexes.
As LZT, but a new phrase isn't created by one character addition to the previous phrase, but the new phrase is constructed by concatenation of two last encoded ones.
Another principle of dictionary construction.
• At the beginning only the single symbols are inserted.
• Dictionary is stored in a tree and contains all the substrings processed by string of the length up to h.
• Full dictionary =>
• statical procedure,
• omitting of nodes with low usage frequency.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
ary restructuring
is" Ongoing organization of source units —> shorter strings of the code.
is" Variants of heuristics (count of occurrences, moving to the beginning (F3STW), change the previous, transfer of X forward).
is" F3STW (advantage: high locality of occurrences of a small number of source units.
is" Example: I'm not going to the forest, ..., An2nkn.
is" Generalization: recency coefficient, Interval coding.
Fetr Sojka FV030 Textual Information Systems
Representation of the word by total sum of words from the last occurrence.
The dictionary contains words a^, a^, ■ ■ ■, an, input sequence contains X/|, x2,xm. The value LAST(a;) containing the interval form last occurrence is initialized to zero.
f or t ■— A to m do begin {xt = a,} if LAST(xt
LAST(xt):=t end .
Sequence y^, y2, ..., ym is an output of encoder and can be encoded by one code of variable length.
Petr Sojka PV030 Textual Information Systems
= 0) then y(t) = t + / - 1
else y(t) = t - LAST(xt);
Huffman coding Adaptive dictionary methods
is" the grammar of the message language is known, is" left partition of derivational tree of string, is" global numbering of rules, is" local numbering of rules.
is" Decision-making states of LR analyzer are encoded.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
is" fixed context - model of order N.
is" combined approach - contexts of various length.
m P(x) = HLo wnPn{x)-
is" wn fixed, variable.
is" time and memory consuming.
is" assignment of probability to the new source unit: e = ^-p, ■ is" automata with a finite context, is" dynamic Markov modeling.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Checking the correctness of the text
is" Checking of text using frequency dictionary.
is" Checking of text using a double frequency dictionary.
is" Interactive control of text (ispell).
is" Checking of text based on regularity of words, weirdneee coefficient.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
Weirdneee coefficient of trigram xyz
KPT = [log(f(xy) - -1) + log(f(yz) - 1)]/2 - log(f(xyz) -A), where f(xy) resp. f(xyz) are relative frequencies of bigram resp. trigram, log(O) is defined as —10.
We/rdness coefficient of word KP5 = ^J^J^PTi - 5KPT2), where
KPT; is a weirdness coefficient of i-th trigram SKPT is a mean rate of weirdness coefficients of all trigrams contained in the word.
Fetr Sojka FV030 Textual Information Systems
Huffman coding Adaptive dictionary methods
us* Querying and TIS models.
bs" Boolean model of documents.
«3° Vector model of documents.
«*■ TIS Architecture.
n®" Signature methods.
us* Similarity of documents.
is" Vector model of documents (completion).
os- Extended boolean model.
«s" Probability model.
is" Model of document clusters.
us" TIS Architecture.
is" Automatic text structuring.
83° Documents similarity.
is" Lexicon storage.
ns= Signature methods. _
Petr Sojka PV030 Textual Information Systems
Different methods of hierarchization and document storage different possibilities and efficiency of querying.
is" Boolean model, SQL.
is" Vector model.
is" Extended boolean types.
is" Probability model.
is" Model of document clusters.
Petr Sojka PV030 Textual Information Systems
The search lies in reducing of uncertainty of a question. Q We find a document with high relevance. Q We start to query with it's key words. O We remove descriptors, or replace them with disjunctions.
Petr Sojka PV030 Textual Information Systems
Boolean model
: querying
System http: //inf omap. Stanford, edu- for working with searched meaning/concept (as opposed to mere strings of characters).
Right query formulation is the half of the answer. The search lies in determination of semantically closest terms.
Petr Sojka PV030 Textual Information Systems
is" Fifties: representation of documents using sets of terms and querying based on evaluation of boolean expressions.
is" Query expression: inductively from primitives:
• term
• attribute.name = attribute.value (comparison)
• function_name(term) (application of function)
and also using parentheses and logical conjunctions X and Y, X or Y, X xor Y, not Y.
is" disjunctive normal form, conjunctive normal form
is" proximity operators
is" regular expressions
is" thesaurus usage
Petr Sojka PV030 Textual Information Systems
Boolean model
ns" boolean operators and, or, xor, not.
os* positional operators adj, (n) words, with, eame, syn.
i®3 SQL extension: operations/queries with use of thesaurus
F3T(A)     Broader term
NT(A)      Narrower term
PT(A)      Preferred term
SYN(A)   Synonyms of the term A
RT(A)      Related term
TT(A)      Top term
Petr Sojka PV030 Textual Information Systems
ORACLE SQL*TEXTRETRIEVAL SELECT specification_of_items FROM specification_of_tables WHERE item
CONTAINS textov_expression
Example:
SELECT TITLE FROM BOOK WHERE ABSTRACT
CONTAINS 'TEXT' AND RT(RETRIEVAL) 'string'  'string'*   *'string' 'st?ing' 'str7«ing'  'stringa'  (m,n) 'stringb' 'multiword phrases' BT('string',n) BT('string', *) NT('string',n)
Petr Sojka PV030 Textual Information Systems
Example:
SELECT NAME FROM EMPLOYEE WHERE EDUCATION CONTAINS   RT(UNIVERSITA) AND LANGUAGES
CONTAINS 'ENGLISH' AND 'GERMAN'
AND PUBLICATIONS
CONTAINS 'BOOK' OR NT('BOOK\*)
Petr Sojka PV030 Textual Information Systems
>lean model
i factor
[fN-AB-N/2)2N aeoC(QA,QB) = lo^0 AB[N_A)[N_B)
A - number of documents „hit" by the query Q/\
B - number of documents „hit" by the query Qg (its relevance we
count)
f - number of documents „hit" by both the queries N - total sum of documents in TIS cutoff (relevant/ Irrelevant)
clustering/nesting A. generation, 2. generation,...
Petr Sojka PV030 Textual Information Systems
Vector model of documents: let a^,... ,an be terms, D^,... ,Dm documents, and relevance matrix W = (w/j) of type m, n,
wu e (0,1)
0 is irrelevant -1   is relevant
Query (2 = (<?i,... ,<?„)
• S(Q, £>,-) = X/ ty^/j similarity coefficient
• head(sort(5(Q, £>,-))) answer
Petr Sojka FV030 Textual Information Systems
■
CONS: doesn't take into account ?"and"? ?"or"? PROS: possible improvement:
• normalization of weights
• Term frequency TF
• Inverted document frequency IDF = log2 f
• Distinction of terms
• normalization of weights for document:   , rp
• normalization of weights for query: ^ x J^Ir? ^ x log2 j [POK, pages 05-113].
Fetr Sojka FV030 Textual Information Systems
is" Interrelations between documents in TIS.
is" Encyclopedia (OSN, Funk and Wagnalls New Encyclopedia).
«s- [SF3A]
http://columbus.cs.nott.ac.uk/compsci/epo/epodd/ep056gs
is" Google/CiteSeer: „automatic structuring of text files"
Petr Sojka PV030 Textual Information Systems
is" Most often cosine measure - advantages.
is" Detailed overview of similarity functions see chapter 5.7 from [K.OR] (similarity).
Petr Sojka PV030 Textual Information Systems
■
^ [MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural language Engineering, 2(A):6A-&0, A996. http://www.research.att.com/~mohri/cll.ps.gz
Petr Sojka PV030 Textual Information Systems