PV030 Textual Information Systems Petr Sojka Faculty of Informatics Masaryk University, Brno Spring 20-12 Petr Sojka FV030 Textual Information Systems Basic info Vart I Information about the course FVOSO Petr Sojka FV030 Textual Information Systems • Petr Sojka, sojka@fi.muni.cz • Consulting hours Spring 20-12: Wednesday -13:00-13:45 Friday-10:00-1-1:50 or write an email with other suggestions to meet. • Room C523/522, fifth floor of block C, Botanická 68>a. • Course homepage: http://www.fi.muni.cz/~sojka/PV030/ • Seminar (Thu -12:00-12:50, C5AA -> B5AA). Petr Sojka FV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus Literature Topice and classification of the course Prerequisites: It is expected the student having basic knowledge of the theory of automata and formal languages (IB005), elementary knowledge of theories of complexity, software programming and systems. Classification: There is a system of classification based on written mid-term (30 points) and final (70 points) exams. In addition, one can get additional premium points based on the activities during lectures. Classification scale (changes reserved) z/k/E/D/C/B/A correspond to obtaining 4Ö/54/60/66/72/7Ö/Ö4 points. Dates of [final] exams will be announced via IS.muni.cz (probably three terms). Fetr Sojka FV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus My books focus on timeless truth. D.E. Knuth, Brno, -1996 An emphasis will be given to the explanation of basic principles, algorithms and (software) design techniques, creation and implementation of textual information systems (TIS)—storage and information retrieval. Fetr Sojka FV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus © Basic notions. TIS (text information system). Classification of information systems. From texts to Watson. © Searching in TIS. Searching and pattern matching classification and data structures. Algorithms of Knuth-Morris-Pratt, Aho-Corasick, reg. expr. © Algorithms of Boyer-Moore, Commentz-Walter, Buczilowski. © Theory of automata for searching. Classification of searching problems. Searching with errors. © Indexes. Indexing methods. Data structures for searching and indexing. © Google as an example of search and indexing engine. Pagerank. Signature methods. © Query languages and document models: Boolean, vector, probabilistic, MMM, Paice. Fetr Sojka FV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus ® Data compression. Basic notions. Entropy. ® Statistical methods. ® Compression methods based on dictionary. O Syntactic methods. Context modeling. Language modeling. Corpora linguistics. © Spell checking. Filtering information channels. Document classification. Neural nets for text compression. Fetr Sojka FV030 Textual Information Systems ^ [MEL] Melichar, B.: Textové informačníeyetémy, skripta ČVUT Praha, 2. vydání, A996. ^ [POK] Pokorný, J., Snášel, V., Húsek D.: Dokumentografícké informační systémy, Karolinum Praha, A99&. ^ [KOR] Korfhage, R. R.: information Storage and Retrieval, Wiley Computer Publishing, -1997. ^ [SMY] Smyth, B.: Computing Patterns in Strings, Addison Wesley, 2003. ^ [KNU] Knuth, D. E.: The Art of Computer Programming, Vol. 3, Sorting and Searching, Second edition, -199&. % [WMB] Witten I. H., Moffat A., Bell T. C: Managing Gigabytes: compressing and indexing documents and images, Second edition, Morgan Kaufmann Publishers, -199&. Petr Sojka FV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus ^ [HEL] Held, G;. Data and Image Compression, Too\s and Techniques, John Wiley & Sons, 4. vydání-1996. U [MEH] Melichar B., Holub J., A 6D Classification of Pattern Matching Problems, Proceedings of The Prague Stringology Club Workshop '97, Prague, July 7, CZ. I [GOO] Brin S., Page, L: The anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7/Computer Networks 30(1-7): A07-AA7 {A998). http://dbpubs.Stanford.edu:8090/pub/1998-8 U [MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural Language Engineering, 2{A):6A-80,A996. http://www.research.att.com/~mohri/cll.ps.gz Petr Sojka PV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus Literature Hl [Sch] Schmidhuber J.: Sequential neural text compression, /EEE Transactions on Neural Networks 7(1), -142—146, -1996, http://www.idsia.ch/~juergen/onlinepub.html II [SBA] Salton G., Buckley Ch., Allan J.: Automatic structuring of text files, Electronic Publishing 5(1), p. 1—17 (March 1992). http://columbus.es.nott.ac.uk/compsci/epo/epodd/ep056gs.htm U [WWW] web pages of the course ~sojka/PV030/, DIS seminars http: //www. inf .upol. cz/dis, http: //nip. f i .muni. cz/, The Prague Stringology Club Workshop 1996-200Ö http://es.felk.cvut.cz/psc/ ^ Jones, S. K.., Willett: Readings in Information Retrieval, Morgan Kaufman Publishers, 1997. Petr Sojka PV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus Literature I ^ Bell, T. C, Geary, J. G., Witten, I. H.: Text Compression, Prentice Hall, Englewood Cliffs, N.J., -199-1. ^ Storer, J.: Data Compression: Methods and Theory, Computer Science Press, Rockwille, -19ÖÖ. Q journals ACM Transactions on information Systems, Theoreticai Computer Science, Neural Network World, ACM Transactions on Computer Systems, Knowledge Acquisition. knihovna.muni . cz, umarecka. cz (textbook Pokorný), Petr Sojka PV030 Textual Information Systems Notions and classification of IS (T)IS classification Information retrieval systems Part II Basic notions of TIS Fetr Sojka FV030 Textual Information Systems reality <—> data T T information need <—> query is- Abstractions and mappings in information systems. is" Information needs about the reality—queries above data. is" Jeopardy game: Watson. Petr Sojka PV030 Textual Information Systems Notions and classification of IS [T)IS classification Notions of (T)1S, PV030 in the context, of teaching at Fl M U Information retrieval systems SB Definition: Information system is a system that allows purposeful arrangement of collection, storage, processing and delivering of information. Definition: Ectosystem consists of IS users, investor of IS, and entrepreneur (user, funder, server). In the example of is.muni.cz they are users of IS, MU represented by bursar, and ICS and IS teams. Ectosystem is not under control of IS designer. Definition: Endo&y&tem consists of hardware used {media, devices), and software (algorithms, data structures) and is under control of IS designer. Fetr Sojka FV030 Textual Information Systems is" effectiveness (user) is" economics (funder) is" efficiency (server) and from different preferences implied compromises. Our view will be view of TIS architect respecting requests of IS ectosystem. For topics related to ectosystem of IS see PV045 Management IS. Petr Sojka PV030 Textual Information Systems Notions and classification of IS [T)IS classification Notions of (T)IS, PV030 in the context, of teaching at FI M U Information retrieval systems From data to wisdom • Data: concrete representation of a message in a form of sequence of symbols of an alphabet. • Information: reflection of the known or the expected substance of realities. An information depends on the intended subject. Viewpoints: • quantitative (information theory); • qualitative (meaning, semantics); • pragmatical (valuation: significance, usefulness, usability, periodicity, up-to-dateness, credibility; • the others (promptness, particularity, completeness, univocality, availability, costs of obtaining). a Knowledge (znalost). a Wisdom (moudrost). Fetr Sojka FV030 Textual Information Systems Notions and classification of IS [T)IS classification Notions of (T)IS, PV030 in the context, of teaching at FI M U Information retrieval systems Definition: Information process is a process of formation of information, its representation in a form of data, its processing, providing, and use. Operations with information correspond to this process. Data/signals —> Information —> Knowledge —> Wisdom. Fetr Sojka FV030 Textual Information Systems Notions and classification of IS (T)[S classification Mini questionnaire Information retrieval systems IS classification by the prevailing function © Information retrieval systems. © Database management systems (DBMS), relational DB (PF3-154, PF3-155, PV003, PV055, PV-136, P3AA4). © Management information systems (PV045). © Decision support systems (PV09Ö). © Expert systems, question answering systems, knowledge-based systems (PA03-1). © Information service systems (web 2.0). Fetr Sojka FV030 Textual Information Systems Notions and classification of IS (T)[S classification Mini questionnaire Information retrieval systems g function (cont.) © Specific information systems (geographical PVCM9, PA049, PA050, medical PV04Ö, environmental PV044, corporate PV043, state administration PV05Ö, PV059, librarian PV070); and also PV063. Application of database systems. Related fields taught in Fl: Software engineering {?M02, PAA05). Similarity searching in multimedia data (PA-12Ö). Efficient use of database systems (PA-152). Introduction to information retrieval (PV2AA). Fetr Sojka FV030 Textual Information Systems Notions and classification of IS (T)[S classification Mini questionnaire Information retrieval systems Fetr Sojka FV030 Textual Information Systems © What do you expect from this course? What was your motivation to enroll? Is the planned syllabus fine? Any changes or surprises? © What do you not expect (you would rather eliminate)? © Which related courses have you already passed? © Practising IS usage (as a user) a) Which (T)IS do you use? b) Intensity? Frequency? How many searching per month? c) Are you satisfied with it? © IS creation (server) a) Which (T)IS and its component have you realized? Area, size? b) Are you satisfied with it? Bottlenecks? Petr Sojka PV030 Textual Information Systems Notions and classification of IS (T)IS classification Classification and formalization of IRS Information retrieval systems Information retrieval systems (IKS)—principles DB 1 Query — * Search — —> Set of selected engine documents Fetr Sojka FV030 Textual Information Systems DB 1 Definition of documents —> Output files format definition —> SEARCH Search method ENGINE definition —> Content of documents —> T Queries Set of ■ selected documents Petr Sojka PV030 Textual Information Systems Notions and classification of IS (T)IS classification Information retrieval systems Classification and formalization of IRS Searching—formalization of the problem Concatenation: string of beads. A bead —> an element. Indexing of elements by natural numbers. Not necessarily numbers, but labels. 0) Every element has unique label. 1) Every labeled element x (except for the leftmost one) has a clear predecessor referred to as pred{x). 2) Every labeled element x [except for the rightmost one) has a clear successor referred to as eucc{x). 3) If the element x is not the leftmost one, x = eucc{pred{x)). 4) If the element x is not the rightmost one, x = pred{eucc{x)). 5) For every two different elements x and y, there exists a positive number /c that is either x = eucck(y) or x = predk(y). Fetr Sojka FV030 Textual Information Systems Notions and classification of IS (T)IS classification Information retrieval systems Classification and formalization of IRS Searching—fcrmalizaticn cf the problem (cont.) The concatenation term: Definition: a string is a set of elements which meets the rules 0)-5). Definition: a linear string: a string that has a finitely many elements including the leftmost and rightmost ones. Definition: a necklace. Definition: an alphabet A. letters of the alphabet. A+. An empty string e. Definition: a finite chain A* = A+ U {s}. Definition: a linear string over A: a member of A+. Definition: a pattern. A text. Fetr Sojka FV030 Textual Information Systems © Classification according to the passing direction: left-to-right/right-to-left. © Classification according to (pre)processing of the text and the pattern: • ad fontes (searching in the text itself); • text surrogate (searching in the substitution of the text); • substitutions: an index: an ordered list of significant elements together with references to the original text; a signature: a string of indicators that shows the occurrence of significant elements in the text. Petr Sojka PV030 Textual Information Systems Notions and classification of IS (T)IS classification Classification and formalization of IRS Information retrieval systems IRS—classification (cont.) text ^reprocessing no yes pattern preprocessing no 1 III yes II IV I - elementary algorithms II - creating a search engine III - indexing methods IV - signature methods Fetr Sojka FV030 Textual Information Systems Notions and classification of IS (T)1S classification Classification and formalization of IRS Information retrieval systems Searching—the formulation of the problem Classification according to the cardinality of the patterns' set: © Search for a single pattern V in the text T. The result: yes/no. © Search for a finite set of patterns P = {i/^, 1/2» • • •»The result: information about position of some of the entered patterns. © Search for an infinite set of patterns assigned by a regular expression R. R defines a potentially infinite set L(R). The result: information about position of some of the patterns from L(R). Alternatives to the formulation of the searching problem: a) the first occurrence; b) the all occurrences without overlapping; c) the all occurrences including overlapping. Fetr Sojka FV030 Textual Information Systems [. SE without preprocessing both patterns and the text [[ - Exact search with query preprocessing Isarp-Rabin search algorithm Part III Exact search Fetr Sojka FV030 Textual Information Systems [. SE without preprocessing both patterns and the text I [[ - Exact search with query preprocessing Rudimentary search algorithm Isarp-Rabin search algorithm Na'fve search, brute force search, rudimentary search algorithm proc Brute-Force-Matcher(PATTERN,TEXT): T:=length[TEXT]; P:=length[PATTERN]; for i:=0 to T-P do if PATTERN[1..P]=TEXT[i+1..i+P] then print "The pattern was found at the position i."; Fetr Sojka FV030 Textual Information Systems [. SE without preprocessing both patterns and the text I [[ - Exact search with query preprocessing Rudimentary search algorithm Isarp-Rabin search algorithm Time complexity analysis of naive search a The complexity is measured by number of comparison, the length of a pattern P, the length of text T. • The upper estimate & = P ■ (T — P + A), thus 0(P x T). • The worst case PATTERN = ap^t>, TEXT = aT^t>. a Natural languages: (average) complexity (number of comparison) substantially smaller, since the equality of prefixes doesn't occur very often. For English: S = Cb ■ (T — P + A), Cb empirically measured A.07, i.e. practically linear. • CCz? CCz vs. CE? • Anyspeedups? An application of several patterns? An infinite number? a We will see the version (5, Q, Ct) of the algorithm in the seminar. Fetr Sojka FV030 Textual Information Systems Express the time complexity of the following search algorithms using the variables c and s, where c is the number of the tests and these statements are true: • if the index / is found, then c = / and s = A; • otherwise, c = T and s = 0. Petr Sojka PV030 Textual Information Systems input: var TEXT : array[1..T] of word; PATTERN : word; output (in the variable FOUND): yes/no A I:=l; c while I< T do begin c if TEXT[I]=PATTERN then break; c-s inc(I) ; end; 2 FOUND:=(I higher effectiveness of algorithms needed. Fetr Sojka FV030 Textual Information Systems ode> During the preprocessing, structure of the query pattern(s) is examined and, on that basis, the search engine is built (on-the-fly). Definition: exact (vs. fuzzy (proximitni)) search aims at exact match (localization of searched pattern(s)). Definition: left-to-right (LR, eoueměrné) (vs. right-to-left (RL, protisměrné)) search compares query pattern to the text from left to right (vs. right to left). Petr Sojka FV030 Textual Information Systems [. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Isarp-Rabin search algorithm © -1 query pattern (vzorek): • Shift-Or algorithm. • Karp-Rabin algorithm, (KR, -19S7). • Knuth-Morris-Pratt algorithm, (KMP, designed (MP) in -1970, published -1977). © n patterns: Aho-Corasick algorithm, (AC, -1975). © co patterns: construction of a search engine (finite automaton) for the search of a potentially infinite set of patterns (given as regular expression). Fetr Sojka FV030 Textual Information Systems [. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Isarp-Rabin search algorithm is" Pattern v^V2- ■ - vm over an alphabet I = a^,... ,ac. , . , . „ , ^ „ f 0 if i/; = a,- is" Incidence matrix X [m x c), Xn = < „ , J J A otherwise. is" Let matrix, column X corresponding to aj is named Ay is" At the beginning, we put unitary vector/column into K. In every algorithm, step R moves down by one line/position, top-most position is filled by zero and one character aj is read from input. Resulted K is combined with Aj by binary disjunction: R := SHIFT(R) OR Aj. is" Algorithm stops successfully when 0 appears at the bottom-most position in R. Fetr Sojka FV030 Textual Information Systems [. SE without prepi II - Exact s ng both patterns and the text :arch with query preprocessing Isarp-Rabin search algorithm Example: V = vzorek over 1 = {e, /c, o, r, v, z}. Cf. [POK, page 3-1-32]. Fetr Sojka FV030 Textual Information Systems [. SE without preprocessing both patterns and the text [[ - Exact search with query preprocessing Karp-Rabin search algorithm Quite different approach: usage of hash function. Instead of matching of pattern with text on every position, we check the match only when pattern 'looks similar as searched text substring. For similarity, a hash function is used. It has to be is" efficiently computable, is" and it should be good at separating different strings (close to perfect hashing). KR search is quadratic at the worst case, but on average 0(T + V). Fetr Sojka FV030 Textual Information Systems [. SE without preprocessing both patterns and the text [[ - Exact search with query preprocessing Karp-Rabin search algorithm rc lementation #define REHASH (a, b, h) (((h-a*d)«l+b) void KR(char *y, char *x, int n, int m) { int hy, hx, d, i; /* preprocessing: computation of d = 2m~A */ d=l; for (i=l; i text index j := A; > pattern index while (/ < T) and (j < V) do while (j > 0) and (text[i] ^ pattern [j]) do j ■■= m end while i:=i + A; j:=j + A end while found := j > V; > if found, it is on the position / — V Fetr Sojka FV030 Textual Information Systems (K)MF Search engir e (finite autor naton) Construct!. ?n of the KMP engine is" 0(T) complexity plus complexity of preprocessing (creation of the array h). is" Animation of tracing of the main part of KMP. Fetr Sojka FV030 Textual Information Systems is" h is used when prefix of pattern 1/^1/2 ... Vj^ matches with substring of text t;_j+2 • • • and i/j 7^ t,-. is" May I shift by more than A? By j? How to compute h? is" h(j) the biggest k < j such that 1/2 • • • ^-1 's suffix of ^2 • • • ^-1, e.g. ^2 • • • ^-1 = Vj-k+2 ■ ■ ■ Vj-A and Vj ± vk. is" KMP: backward transitions for so long, so that j = 0 (prefix of pattern is not contained in the searched text) or t; = Vj [Vy\V2...Vj = tj-j+A ti-j+2 ■ ■ ■ 1;). is" Animation Lecroq, also [POK, page 27], also see [MAR] for detailed description. Petr Sojka FV030 Textual Information Systems (K)MF Search engine (finite automaton) Construction of the KMF engine i:=l; j:=0; h[l]:=0; while (i0) and (v[i]<>v[j]) do j:=h[j]; i:=i+l; j:=j+l; if (i<=V) and (v[i]=v[j]) then h[i]:=h[j] else h[i] :=j (*MP*) end; Complexity of h computation, e.g. preprocessing, is 0{V), thus in total 0(T + V). Example: h for ababa. KMPvs. MP. Fetr Sojka FV030 Textual Information Systems (K)MF Search engine (finite automaton) Construction of the KMF engine that uses transition table q derived from the searched pattern, (g relates to the transition function 6 of FA): var i,T:integer; found: boolean; text: array[l..T] of char; state,qO: TSTATE; g:array[1..maxstate,1..maxsymb] of TSTATE; F: set of TSTATE;... begin found:= FALSE; state:= qO; i:=0; while (i <= T) and not found do begin i:=i+l; state:= g[state,text [i]]; found:= state in F; end; end; How to transform pattern into g? Fetr Sojka FV030 Textual Information Systems (K)MP Search engine (finite automaton) Construction of the KMP engine -right search is- 5E for left-to-right search A = (Q, T, g, h, q0, F) 9 Q is a finite set of states. • T is a finite input alphabet. • g: Q x T —> QU {fail,} is a forward state-transition function. • h: [Q — qo) —> Qis a backward state-transition function. • qo is an initial state. • F is a set of final states. is" A depth of the state q: d(q) £ No is a length of the shortest forward sequence of the state transitions from qo to q. Petr Sojka PV030 Textual Information Systems (K)MP Search engine (finite automaton) Construction of the KMF engine is" Characteristics q, h: • Oj[c\p,a) 7^ fa\\ for Va 6 T [there is no backward transition in the initial state). • lf/i(cj) = p, then d(p) < (the number of the backward transitions is restricted from the top by a multiple of the maximum depth of the state c and the sum of the forward transitions V). So the speed of searching is linear in relation to V. Fetr Sojka FV030 Textual Information Systems (K)MP Search engine (finite automaton) Construction of the KMF engine is" SE configuration (q,w), q G Q, w G T* the not yet searched part of the text. is" An initial configuration ofSE {qo,w), w is the entire searched text. is" An accepting configuration of&E (q, w), qEf,w\s the not yet searched text, the found pattern is immediately before w. «s- SE transition: relation hC (Q x T) x (Q x T*): • g{q,a) = p, then {q,aw) h (p, w) forward transition for \/w G r*. • = p, then (cj, w) h (p, tv) backward transition for \/w G r*. Fetr Sojka FV030 Textual Information Systems (K)MF Search engin le (finite autor naton) Construct!. ?n of the KMP engine During the forward transition, a single input symbol is read and the engine switches to the next state p. However, if g{q,a) = fail, the backward transition is executed without reading an input symbol. S = 0(T) (we measure the number of SE transitions). Fetr Sojka FV030 Textual Information Systems (K)MP Search engine (finite automaton) Construction of the KMF engine Construction of the KMP SE for pattern i/2 ... vY © An initial state cjo- © g{q, ) = q', where q' is equivalent to the prefix i/^ 1/2 • • • Vjvj+A ■ © For qo, we define g{qo,a) = qoforVa, for which g{qo,a) has not been defined in the previous step. © g{q,a) = fai|for and a, for which g{q,a) has not been defined in the previous steps. © A state that corresponds to the complete pattern is the final one. © The backward state-transition function h is defined on the page 5-1 by the below mentioned algorithm. Fetr Sojka FV030 Textual Information Systems (K)MP Search engine (finite automaton) Construction of the KMF engine © Summary of the previous lecture, searching with SE. © Left-to-right search of n patterns algorithms. (AC, NFA -» DFA.) © Left-to-right search of infinite patterns algorithms. © Regular expressions (RE). © Direct construction of (N)FA for given RE. Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching Part V Search of a finite set of patterns Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching SE for left-to-right search of a set of patterns p = {i/1, i/2,..., vF}. instead of repeated search of text for every pattern, there is only "one" pass (FA). Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching var text: array[1..T] of char; i: integer; found: boolean; state: tstate; g: array[1..maxstate,1..maxsymbol] of tstate; h: array[1..maxstate] of tstate; F: set of tstate; found:=false; state:=q0; i:=0; while (i<=T) and not found do begin i:=i+l; while g[state,text[i]]=fail do state:=h[state]; state:=g[state,text[i]]; found:=state in F end Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching • Construction of the state-transition functions h, g? a How about for P patterns? The main idea? • Aho, Corasick, -1975 (AC search engine). Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching Construction of q for AC SE for a set of patterns p = {i/1, i/2,..., i/p} © An initial state qp. © g{q, = cj', where q' is equivalent to the prefix b^2 ■■■ bj+A of the pattern v1, for V/ 6 {1,... ,P}. © For qo> we define g(^o> a) = cjo f°r Va, for which g{qo, a) has not been defined in the previous steps. © g{q,a) = faH for and a, for which g{q,a) has not been defined in the previous steps. © A state that corresponds to the complete pattern is the final one. An example: p ={he, she, her} over T ={h, e, r, s, x}, where x is anything else than {h, e, r, s}. Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching Construction of h for AC SE for a set of patterns p = {i/1, i/2,..., i/p} At first, we define the failure function f inductively relative to the depth of the states this way: © For of the depth A, f(q) = qo- © let us assume that f is defined for each state of the depth d and lesser. The variable qp denotes the state of the depth d and Oj[c\p,a) = q''. Then we compute f{q') as follows: while g{q, a) = faH do q := f{q); Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching • The cycle terminates, since Oj[c\p,a) 7^ fail. • If the states q, r represent prefixes u, 1/ of some of the patterns from p, then f{q) = r 1/ is the longest proper suffix u. Fetr Sojka FV030 Textual Information Systems Petr Sojka FV030 Textual Information Systems a We could use f as the backward state-transition function h, however, redundant backward transitions would be performed. • We define function h inductively relative to the depth of the states this way: • For V state q of the depth A, h(q) = Op. a let us assume that h is defined for each state of the depth d and lesser. Let the depth qbe d + A. If the set of letters, for which is in a state f{q) the value of the function q different from fail, is the subset of the set of letters, for which is the value of the function gin a state q different from faii, then h(q) := h(f(q)), otherwise h(q) := f{q). Petr Sojka PV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching Fetr Sojka FV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching Finite automata for searching Deterministic ■finite automaton (DFA) M={K,T,6,o[p,Jr) © K is a finite set of inner states. © T is a finite input alphabet. © 6 is a projection from K xT to K. © £ K is an initial state. © F C K is a set of final states. Fetr Sojka FV030 Textual Information Systems Search ofn patterns Ah o-Cora sick algorithm Finite automata for searching Finite automata for searching © Completely specified automaton if <5" is defined for every pair a) GKxT, otherwise incompletely specified automaton. © Configuration M is a pair (cj, w), where q G K., w G T* \e the not yet searched part of the text. © An initial configuration M is (cjo. w), where w is the entire text to be searched. © An accepting configuration M is [q, w), where q G F and G T*. Fetr Sojka FV030 Textual Information Systems During the transition, a single input symbol is read and the engine switches to the next state p. is" Transition M: is defined by a state and an input symbol; relation hC (K x T*) x (K x T*); if = p, then (q.aw) h (p, w) for every Vtv £ T*. is" The fcth power, transitive or more precisely transitive reflexive closure of the relation h: \-k, h+, h*. oar L(M) = {w eT* : [qo,w) h* {q,w') for some q G F, w' G T*} the language accepted by FA M. is" time complexity 0(T) (we measure the number of transitions of FA M). Petr Sojka PV030 Textual Information Systems Search ofn patterns Aho-Corasick algorithm Finite automata for searching Definition: Nondeterminietic ■finite automaton (NFA) is M = (K, T, 2K <5{q,a) is now a set of states. Definition: he (K x T*) x (K x T*) transition: if p £ 5{q,a), then (q, atv) h (p, tv) for Vtv £ T*. Definition: a final state, L(M) analogically as in DFA. Fetr Sojka FV030 Textual Information Systems Search of n patterns Ah o-Cora sick algorithm Finite automata for searching Construction of SE (DFA) from NFA Theorem: for every nondeterministic finite automaton M=(K,T,$,qo>F)> we can build deterministic finite automaton M'={K',T,6',c(0, f) such that L(M) = L(M'). Fetr Sojka FV030 Textual Information Systems Search of n patterns Ah o-Cora sick algorithm Finite automata for searching Construction of SE (DFA) from NFA (cont.) A constructive proof [of the algorithm): Input: nondeterministic FA M = (K, T, c7, cjo. F)-Output: deterministic FA. © K/={{^o}}> state {qo} in unmarked. © If there are in K' all the states marked, continue to the step 4. © We choose from Kf unmarked state o[\ • 5'{q',a) = []{5{p,a)}for^p G o[ and a G T; • K' = K' \J6'{q',a)for\/a G T; • we mark cj' and continue to the step 2. Fetr Sojka FV030 Textual Information Systems Construction q( for SE for a set of patterns p = {i/1, i/2,..., vp} © We create NFA M: • An initial state qo- 9 for Va G T, we define g(^o» a) = ^o- • For V/ G {-1,..., P}, we define bj+i) = q', where q' is equivalent to the prefix ■■■ bj+A of the pattern v'. • The state corresponding to the entire pattern is the final one. © ... and its corresponding DFA M' with q'. Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Part VI Search for an infinite set of patterns Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Definition: Regular expression E over the alphabet A: ® e,0 are RE and for Va £ A is a RE. © If x, y are RE over A, then: • (x + y) 's (union); • (x.y) is RE (concatenation); • (x)* is RE (iteration). A convention about priority of regular operations: union < concatenation < iteration. Definition: Thereafter, we consider as a (generalized) regular expression even those terms that do not contain, with regard to this convention, the unnecessary parentheses. Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions © h{0) = 0, h{s) = {e}, h{a) = {a} © • h(x + y) = h(x) U h(y) a h(x.y) = /i(x)./i(y) • h(x*) = (h(x))* »3= f](X*) = 6 U X U x.x U x.x.x U . . . The value of RE is a regular language (RL). «s" Every RL can be represented as RE. is- For V RE V 3 FA M: h{V) = L(M). Petr Sojka PV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions A-1: x + (y + z) = (x + y) + z = x + y + z associativity of union A2: x.(y.z) = = (x.y).z = x.y.z associativity of concatenation A3: x + y = y + x commutativity of union A4: (x + y).i : = x.z + y.z right distributivity A5: x.(y + z ) = x.y + x.z left distributivity A6: x + x = x idempotence of union A7: 6.x = x identity element for concatenation AO: 0.x = 0 inverse element for concatenation A9: x + 0 = x identity element for union MO: x* = s -+ x*x MA: X* = (žľ 4-x)* Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions © Summary of the previous lecture. © Regular expressions, value of RE, characteristics. © Derivation of regular expressions. © Direct construction of equivalent DFA for given RE by derivation. © Derivation of regular expressions by position vector. © Right-to-left search (F3MH, CW, BUC). Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Theorem: the axiomatization of RE is complete and consistent. Definition: regular expressions are termed as similar, when they can be mutually conversed using axioms AA to AAA. theorem: similar regular expressions have the same value. Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Definition: the length d{E) of the regular expression E: © If E consists of one symbol, then d(E) = A. © d(Vi + Vz) = d{V^) + d{Vz) + -1. © d{VA Y2) = d{VA) + d(V2) + -1. © d0/*) = d(l/) + -1. © d{{V)) = d{V) + 2. Note: the length corresponds to the syntax of a regular expression. Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Construction of NFA for given RE Definition: a generalized NFA allows e-transitions (transitions without reading of an input symbol). Theorem: for every RE E, we can create FA M such that h{E) = L(M). Proof: by structural induction relative to the RE E: Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions n RE (a proof) Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Construction of NFA for given RE (cont. of a proof) Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions E (cont.) is" No more than two edges come out of every state. is" No edges come out of the final states. is" The number of the states M < 2 ■ d(E). is" The simulation of automaton M is performed in 0(d(E)T) time and in 0(d(E)) space. Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions For the following methods of NFA simulation, we must remove the 6-transitions. We can achieve it with the well-known procedure: Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions We represent a state with a Boolean vector and we pass through all the paths at the same time. There are two approaches: is" The general algorithm that use a transition table. is" Implementation of the automaton in a form of (generated) program for the particular automaton. Fetr Sojka FV030 Textual Information Systems r given RE Let E is a RE over the alphabet T. Then we create FA M = (K, T, : a string of h(E') can end with the symbol X + e). © A set of states K = {q0} UZU {y,- : xy- 6 F}. © A transition function 8: • 8(qo, x) contains xf for.Vx 6 Z that originate from numbering of x. • vk} OUTPUT: CW search engine METHOD: we construct the function g and introduce the evaluation of the individual states w: Q An initial state qo; w{qp) = s. Q Each state of the search engine corresponds to the suffix bmbm+i ... t>„ of a pattern V\ of the set P. Let us define g{q,a) = q', where q' corresponds to the suffix abmbm+^ ... t>„ of a pattern V\\ w[q) = bn ...bm+-]bm; w[q') = w[q)a. Q g{q,a) = fan for every q and a, for which g{q,a) was not defined in the step 2. Q Bach state, that correspond to the full pattern, is a final one. Fetr Sojka FV030 Textual Information Systems Definition: 6hift[5T ATE, TEXT [\ - J]] = min {A, shift2(5TATE)}, where A = max {shiftA {STATE), char{TEXT[l -J])-J-A}. The functions are defined this way: Q char(a) is defined for all the symbols from the alphabet T as the least depth of a state, to that the CW search engine passes through a symbol a. If the symbol a is not in any pattern, then char(a) = LMIN + i, where LMIN is the length of the shortest pattern. Formally: char(a) = min -[LMIN + i, min{d(q)| w(q) = xa, x 6 T*}}. Q Function sfr/fti(q0) = for the other states, the value is sfr/fti (q) = min {LMIN, A}, where A = min{k| k = d(cf) — d(q), where w(q) is its own suffix w(q') and a state q' has higher depth than q}. Q Function shift2(q0) = LMIN; for the other states, the value is shift2(q) = min{/\, B}, where A = min{k| k = d(cf) — d(q), where w(q) is a proper suffix tv(q') and is a final state}, B = shift2(q')\q' is a predecessor of Fetr Sojka FV030 Textual Information Systems Example: P = {cacbaa, aba, acb, acbab, ccbab}. LMIN = 3, a b c X char A A 2 4 shiftA sh/ft2 6 A 3 a A 2 b A 3 aa 3 2 ab A 2 be 2 3 ba A aab 3 2 aba 3 2 bca 2 2 bab 3 aabc 3 2 babe 3 aabca 3 2 babca 3 b a be c 3 aabcac 3 2 Petr Sojka PV030 Textual Information Systems Right-to-left search of one pattern © Right-to-left search of an infinite set of patterns © Two-way jump automaton - a generalization of the so far learned left-to-right and right-to-left algorithms. © Hierarchy of the exact search engines. Fetr Sojka FV030 Textual Information Systems Right-to-left se, arch for an inf. set of patterns Generalization of SE Search engine hierarchy Part VIII Search for an infinite set of patterns Fetr Sojka FV030 Textual Information Systems Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy Right-to-left search for an inf set of patterns Definition: reversed regular expression is created by reversion of all concatenation in the expression. Example: reversed RE for E = bc[a + a*bc) is E^ = (a + cba*)cb: Fetr Sojka FV030 Textual Information Systems Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy set of patterns (cent.) Buczitowski: we search for E such that we create E^ and we use it for determination of shift[5TATE, SYMBOL] for each state and undefined transition analogically as in the CW algorithm: a b c X 0 A 3 ■ A A A 2M 2 A 3 A A A 4 A A A A 5 A A A Fetr Sojka FV030 Textual Information Systems Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy Definition: 2DFAS is M = (Q, I, A,d(q,a,+^) = (q', —A) (right-to-left comparison), "a* a^ ... a,■ q a-i+A ... a^ | aj... anV a^... a,a,+^... a^-A q' "\ at ■ ■ ■ an for t5(q,ai+i) = (q',m), m>A, t = min{/' + m,n + A} (right-to-left jump), "a* a^ ... aj q a^ ... a,_^ f a,... an h a^... a^a^A, ... a^-A q' T at ■ ■ ■ an for S(q,a{) = (q',m), m>A, t = min{f + m, n + A} (left-to-right jump),. 03* a^ ...a^ qaj...a,_^ f a,a,+^...an h a^ ...aj_^ aj...a,_^a, f ... an for / > -1, 0 O d(x, x) = 0 Q d(x, y) = d(y, x) (symmetry) Q d(x, y) = 0 x = y (identity of indiscernibles) O d(x,y) + d(y,z) > d{x,z) [triangle inequality) We call the values of the function d (distance). Fetr Sojka FV030 Textual Information Systems fuzzy search: metrics Classification of search problems Definition: let us have strings X and Y over the alphabet Z. The minimal number of editing operation for transformation X to Y is is" Hamming distance, R-distance, when we allow just the operation Replace, is" Levenshtein distance, DIR.--distance, when we allow the operations Delete, Insert and Replace, is" Generalized Levenehtein distance, DIRT-distance, when we allow the operations Delete, Insert, Replace and "Transpose. Transposition is possible at the neighbouring characters only. They are metrics, Hamming must be performed over strings of the same length, Levenshtein can be done over the different lengths. Fetr Sojka FV030 Textual Information Systems fuzzy search: metrics Classification of search problems Example: Find such an example of strings X and Y, that simultaneously holds R(X,Y) = 5, D\K(X, Y) = 5, and D\KT(X, Y) = 5, or prove the non-existence of such strings. Example: find such an example of strings X and Y, that holds simultaneously K(X, Y) = 5, D\K(X, Y) = 4, and D\KT(X, Y) = 3, or prove the non-existence of such strings. Example: find such an example of strings X and Y of the length 2n, neN, that R(X, Y) = 2n and a) Y) = 2; b) Y) = Fetr Sojka FV030 Textual Information Systems Fuzzy search: metrics SFOECO, QFOECO, SSOECO Classification of search problems QSOECO, SFORCO, SFODCO Classification of search problems Definition: Let T = t2 • • • tn and pattern P = p^p2 ■■ ■ pm. For example, we can aek: Q is P a substring of T? O is P a subsequence of T? @ is a substring or a subsequence P in T? O is P in T such that D{P, X) < k for k < m, where X = t,... tj is a part of T (D is P, D/P or DIRT)? Q is a string P containing don't care symbol 0 (*) in T? © is a sequence of patterns P in T? Furthermore, the variants for more patterns, plus instances of the search problem yes/no, the first occurrence, all the overlapping occurrences, all the also non-overlapping occurrences. Fetr Sojka FV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Fuzzy search: metrics SFOECO, QFOECO, SSOECO Classification of search problems QSOECO, SFORCO, SFODCO 6D classification of search problems (cont.) Dimension A 2 3 4 5 6 5 F 0 E C 0 Q S F R D S 1 D G In total 2x2x3x4x2x2 = 192 search problems classified in a six-dimensional space. For example, SFO??? denotes all the SE for search of one (entire) string. For all these problems, we are going to learn how to create NFA for searching. Fetr Sojka FV030 Textual Information Systems Example: let P = p<] p2ps ■ ■ ■ pm, m = 4, A is any character of I. NFA for SFOECO: A Petr Sojka PV030 Textual Information Systems Fuzzy search: metrics SFOECO, QFOECO, SSOECO Classification of search problems QSOECO, SFORCO, SFODCO Search for a sequence cf characters Example: NFA for QFOECO (seQuence): A p2 p3 p4 p is any character of I except for p. Automaton has m + A states for a pattern of the length m. Fetr Sojka FV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Example: NFA for QSOECO is similar, we just add some cycles for non-matching characters and e transitions to all the existing forward transitions (or we concatenate the automaton m-times). Definition: Automaton for QSOECO is called e-treelis. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Definition: This automaton is called R-treelie, and has [m + A) + m + [m-A)-\----+ [m - k + A) = [k + A)[m + A - ^) states. The number of the level of the final state corresponds to the length of the found string from the pattern. Petr Sojka FV030 Textual Information Systems Fuzzy search: metrics SFOECO, QFOECO, SSOECO Classification of search problems QSOECO, SFORCO, SFODCO Fetr Sojka FV030 Textual Information Systems Fuzzy search: metrics SFOECO, QFOECO, SSOECO Classification of search problems QSOECO, SFORCO, SFODCO SFOGCO For the DIRT-distance, we add more new states to the SFODCO automaton that correspond to the operation of transposition and also the corresponding pair of edges for every transposition. Animation program by Mr. Pojerforthe discussed search automata is available for download from the course web page and is also installed in Simulation of NFA or determinisation? A hybrid approach. The Prague Stringology Club and its conference series: see http://www.stringology.org/. Fetr Sojka FV030 Textual Information Systems Fuzzy 2 Classification of SFOECO, QFOECO, SSOECO O, SFORCO, SFODCO © Searching with text preprocessing; indexing methods. © Methods of indexing. © Automatic indexing, thesaurus construction. © Ways of index implementation. Petr Sojka FV030 Textual Information Systems Part X Indexing Methods Petr Sojka PV030 Textual Information Systems Large amount of texts? The text preprocessing! is" Index, indexing methods, indexing file, indexsequential file. is" Hierarchical structuring of text, tagging of text, hypertext. is" Questions of word list storing (lexicon) and occurrence (hit) list storing, their updating. Petr Sojka PV030 Textual Information Systems is" granularity of the index items: document - paragraph - sentence - word word-1 word2 word3 word4 docA A A 0 A doc2 0 A A A doc5 A 0 A A is" inverted file, transposition docA doc2 doc5 wordA A 0 A word2 A A 0 word5 0 A A word4 AAA Fetr Sojka FV030 Textual Information Systems is" Word order (primary key) in index —> binary search Time complexity of one word searching in index: n index length, V pattern length 0(V x log2(n)) is" searching for k words, pattern p = ,..., Korder text words count "a- The rule 20-30: 20% of the most frequent words make 30% of text [MEL, fig. 4.19]. Petr Sojka PV030 Textual Information Systems Automatic indexing method is based on word significance derivation from word frequencies (cf. Collins-Cobuild dictionary); words with low and high frequency are cut out: INPUT: n documents OUTPUT: a list of words suitable for an index creation O We calculate a frequency for every document i £ {A,n) and every word k £ {A,K) [K is a count of different words in all documents]. 9 We calculate TOTFREQk = ^"=A FKEQk. Q We create a frequency dictionary for the words k £ {A,K). Q We set down a threshold for an exclusion of very frequent words. Q We set down a threshold for an exclusion of words with a low frequency. Q We insert the remaining words to the index. Questions of threshold determination [MEL, fig. 4.20]. Petr Sojka PV030 Textual Information Systems is" Excursus to the computational linguistics, is" Corpus linguistics as an TIS example. is" Search methods with preprocessing of text and pattern (query). Petr Sojka FV030 Textual Information Systems Morphology utilization for creating of dictionary is" stem/ root of words (ucit, uc); is" program ajka (abin), http: //nip. f i.muni. cz/pro j ekty/ajka/ examples; is" a techniques of patterns for stem determination; Petr Sojka FV030 Textual Information Systems is" Thesaurus - a dictionary, containing hierarchical and associative relations and relations of equivalence between particular terms. is" Relations between terms/lemmas: • synonyms - relation to a standard term; e.g. „see"; • relation to a related term (RT); e.g. „see also"; • relation to a broader term (BT); • relation to a narrower term (NT); • hypernyms (canmeans of transport); hyponyms (birdyay); meronym (doonlock); holonyms (hand:body); antonyms [qood:bad). is" Dog/Fik, Havel/president Petr Sojka PV030 Textual Information Systems manually/ half-automatically is" heuristics of thesaurus construction: • hierarchical structure/s of thesaurus a field thesauri, the semantics is context-dependent {e.g. field, tree in Informatics) » compounding of terms with a similar frequency • exclusion of terms with a high frequency is" breadth of application of thesaurus and lemmatizer: besides of spelling indexing, base of grammar checker, fulltext search. «s- projekts WORDNET, EUROWORDNET is" module add wordnet; wn wn faculty -over -simsn -coorn Petr Sojka PV030 Textual Information Systems is" Knowledge base creation for exact evaluation of document relevance. is" topic - processing of semantic maps of terms Visual Thesaurus http://www.visualthesaurus.com. is- Tovek Tools, Verity. Petr Sojka PV030 Textual Information Systems Index System Implementation Part XI Excursus to the Computational Linguistics Petr Sojka PV030 Textual Information Systems Index System Implementation is" string searching - words are strings of letters, is" word-forming - morphological analysis. is" grammar (CFG, DFG) - syntactic analysis, is" meaning of sentences (TIL) - semantic analysis, is" context - pragmatic analysis. is" full understanding and communication ability - information. Petr Sojka PV030 Textual Information Systems Index System Implementation basic queries • „Havel"; 45: Český prezident Václav se včera na 09: jak řekl Václav , každý občan 24Ô: více než rokem řekl Pravda vítězí regular expressions • „Pravda I pravda"; • „(P|p)ravda"; • „(P|p)ravd[a,u,o,y]"; • „pravd.*"; „pravd.+"; „post?el"; word sequence • „prezident(a|u)" „Havl(a|ovi)"; • „a tak"; • „prezident"; []* „Havel"; • „prezident" („republiky" „Václav")? „Havel"; Petr Sojka PV030 Textual Information Systems Index System Implementation queries for positional attributes • [word = „Havel"]; • [lemma = „prezident"] []* [lemma = „Havel"]; • ... ženu prezidenta Havla ... [lemma = „hnát"] [] [lemma = „Havel"]; • [word = „žen(u|eme)" & lemma !=„žena"]; I ... or ! ... not some other possibilities • [lemma = „prezident"] []* [lemma = „Havel"] within 6; ... 10, 3 6 • [lemma = „Havel"] within 20 „Pravda" • a;[word= „Žena|Muž|Clověk"] []* [lemma = a.lemma] Petr Sojka PV030 Textual Information Systems Index System Implementation Large computational power of today's computers enables: • efficient storing of large amount of text data (compression, indexing); • efficient search for text strings. A man sitting behind a computer uses all this, to obtain from so processed documents information, that he is interested. Really? Example: In text database there is stored a few last years of daily newspaper. I'd like to obtain information about president Vaclav Havel. a/>HAVEL b/>more precise queries Computer (computational power) human (inteligence) valuable information The goal of + time + money everybody —> is to transfer the largest possible part of intelligence (time, money, to computer. Petr Sojka PV030 Textual Information Systems Index System Implementation information ideal of ideals no Searching pragmatic analysis context no information Correct semantic analysis sentence meaning TIL starting-up Spell translation syntactic analysis grammar CFG, DCG partially check morphological analysis word-forming lemma yes Check Simple translat words are strings of letters string searching yes Petr Sojka PV030 Textual Information Systems Index System Implementation on from natural language Do we really know, what is information contained in the text in natural language? • František Novák weighs i 00 kg object property value • František Novák likes beer. František Novák likes Jana Novotná • F. N. is an old honest man. Spring erupted in full force. Words of the natural language denote objects, their properties and relations between them. It's possible to see the words and sentences also as ..functions" of its kind, defined by their meaning. • A man, who climbed a highest Czech eight-thousander, is my grandson. —> RDB attribute, attribut2, ? key value ? Petr Sojka PV030 Textual Information Systems Index System Implementation is" Corpus: electronic collection of texts, often indexed by linguistic tags. is" Corpus as a text information system: corpus linguistics. is" BNC, Penn Treebank, DESAM, PNK,...; ranges from millions to billion positions (words), special methods necessary. is" Corpus managers CQP, GCQ.P, Manatee/Bonito, http://www.f i.muni.cz/~pary/ see [MAR]. Petr Sojka PV030 Textual Information Systems Index System Implementation Definition: Corpus is a large, internaly structured compact file of texts in natural language electronically stored and processable. • Indian languages have no script - for a finding of a grammar it's necessary to write up the spoken word. a -1967 - A. corpus in U. 5. A. (Kučera, Francis) A 000000 words. a Noam Chomsky - refuses corpora. • Today - massive expansion. Petr Sojka PV030 Textual Information Systems Index System Implementation • WWW page of Pavel Rychlý (~pary) links to basic information. Bonito, Manatee. • IMS CORPUS WORKBENCH - a toolkit for efficient representation and querying over large text files. Petr Sojka PV030 Textual Information Systems Index System Implementation Sequence of words at numbered positions (first word, nth word), to which tags are added (addition of tags called corpus tagging). Tags are morphological, grammatical and any other information about a given word. It leads to more general concept of position attributes, those are the most Important tagging type. Attributes of this class have a value (string) at every corpus position. To every of them one word is linked as a basic and positional attribute word. In addition to this attribute, further position attributes may be bundled with each position of any text, representing the morphological and other tags. Structural attributes - sentences, paragraphs, title, article, SGML. P0S2 POS-1 LEMMA český prezident vaclav navel dnes ~ -107 WORD Českého prezidenta Václava Havla dnes I_I_I_I_I 0 12 3 4 Petr Sojka PV030 Textual Information Systems Index System Implementation Two key terms of internal representation of position attributes are: • Uniform representation: items for all attributes are encoded as integer numbers, where the same values have the same digital code. A sequence of items is then represented as a sequence of integers. Internal representation of attribute word (as well as of any other pos. attribute) is array (0. .p-1) of Integer, where p is position count of corpus. • Inverted file: for a sequence of numbers representing a sequence of values of a given attribute, the inverted file is created. This file contains a set of occurrences in position attribute for every value (better value code). Inverted file is needed for searching, because it directly shows a set of occurrences of a given item, the occurrences then can be counted in one step. Fetr Sojka FV030 Textual Information Systems Index System Implementation File with encoded attribute values and inverted file as well have auxiliary files. • The first data structure is a list of items or Jexicon": it contains a set of different values. Internally it's a set of strings occurring in the sequence of items, where a symbol Null (octal 000) is inserted behind every word. The list of items already defines a code for every item, because we suppose the first item in the list to have a code 0, following A etc. Petr Sojka PV030 Textual Information Systems There are three data structures for the inverted file: • The first is an independent inverted file, that contains a set of corpus positions. • The second is an index of this file. This index returns for every code of item an input point belonging to an occurrence in inverted file. • The third is a table of frequency of item code, which for each item code gives a number of code occurrence in corpus (that is of course the same as the size of occurrence set). Petr Sojka PV030 Textual Information Systems Index System Implementation Preprocessing of text and pattern (query): overwhelming majority of today's TIS. Types of preprocessing: is" n-gram statistics (fragment indexes). is" special algorithms for indexes processing (coding, compression) and relevance evaluation (PageRank Google) is" usage of natural language processing methods (morphology, syntactic analysis, semantic databases) an aggregation of information from multiple sources (systems AnswerBus, START). is" signature methods. Petr Sojka PV030 Textual Information Systems Index System Implementation Petr Sojka PV030 Textual Information Systems Index System Implementation Definition: Relevance (of answers to a query) is a rate range, by which a selected document coincides with requirements imposed on it. Ideal answer = real answer Definition: Coefficient of completeness (recall) R = ^, where m is a count of selected relevant records and n is a count of all relevant records in TIS. Definition: Coefficient of precision P = ^, where o is count of all selected records by a query. We want to achieve maximum P and P, tradeoff. Standard values: £0% for P, 20% for P. Combination of completeness and precision: coefficient řb = ■ (Fo = P,FC0 = R, where FA = F P and R weighted equally). Petr Sojka PV030 Textual Information Systems Index System Implementation is" The fragment ybd is in English only in the word molybdenum. is" Advantages: fixed dictionary, no problems with updates. is" Disadvantages: language dependency and thematic area, decreased precision of search. Petr Sojka PV030 Textual Information Systems Index System Implementation is" Google as an example of web-scale information system. is" Jeff Dean's video - historical notes of Google search developments. is" Google - system architecture. is" Google - PageRank. is" Google File System. is" Implementation of index systems Petr Sojka PV030 Textual Information Systems Index System Implementation Gooooooooooooooq\e - a bit of history An example of anatomy of global (hyper)text information system (www.google.com). is" -1997: google.stanford.edu, students Page and Brin is" A99&: one of few quality search engines, whose basic fundamentals and architecture (or at least their principles) are known - therefore a more detailed analysis according to the article [GOO] http://www7.conf.au/programme/fullpapers/1921coml921.htm. is" 20-12: clear leader in global web search Petr Sojka PV030 Textual Information Systems Index System Implementation is" Several innovative concepts: PageRank, storing of local compressed archive, calculation of relevance from texts of hypertext links, PDF indexing and other formats, Google File System, Google Link... is" The system anatomy, see [MAR] Petr Sojka PV030 Textual Information Systems Index System Implementation The crucial thing is documents' relevance (credit) computation. is" Usage of tags of text and web typography for the relevance calculation of document terms. is" Usage of text of hyperlink is referring to the document. Petr Sojka PV030 Textual Information Systems Index System Implementation is" PageRank: objective measure of page importance based on citation analysis (suitable for ordering of answers for queries, namely page relevance computation). is" Let pages 7^,.. .,Tn (citations) point to a page A, total sum of pages is m. PageRank PK(A) = 0 m V C(7V,) + ... p£(t„; C(Tn) is" PageRank can be calculated by a simple iterative algorithm (for tens of millions of pages in hours on a normal PC). is" PageRank is a probability distribution over web pages. is" PageRank is not the only applied factor, but coefficient of more factors. A motivation with a random surfer, dumping factor d, usually around 0.(35. FV030 Textual Information Systei Index System Implementation is" Storing of file signatures is" Storing of lexicon is" Storing of hit list, is" Google File System Petr Sojka PV030 Textual Information Systems Index System Implementation is" Inverted file - indexing file with a bit vector. is" Usage of document list to every key word. is" Coordinate system with pointers [MEL, fig. 4/12>, page 46]. is" Indexing of corpus texts: Finlib http://www.fi.muni.cz/~pary/dis.pdf see[MAR]. is" Use of Elias coding for a compression of hit list. Petr Sojka PV030 Textual Information Systems Index System Implementation it. is" Efficient storing of index/dictionary [lemmas]: packed the, Patricia tree, and other tree structures. is" Syntactic neural network (S. M. Lucas: Rapid best-first retrieval from massive dictionaries, Pattern Recognition Letters 17, p. 1507-15-12,1996). is" Commercial implementations: Verity engine, most of web search engines - with few exceptions - hide their key to success. Petr Sojka PV030 Textual Information Systems Index System Implementation Article M. Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing see [MAP] is" Dictionary representation by finite automaton. is" Ambiguities, unification of minimized deterministic automata. is" Example: done,do.V3:PP done,done.AO is" Morphological dictionary as a list of pairs [word form, lemma]. is" Compaction of storing of data structure of automata (Liang, -1983). is" Compression ratio up to -1:20 in the linear approach (given the length of word). Petr Sojka PV030 Textual Information Systems Index System Implementation is" Transducer for dictionary representation. is" Deterministic transducer with -1 output (subsequential transducer) for dictionary representation including one string on output [Information about morphology, hyphenation,...). is" Deterministic transducer with p outputs (p—subsequential transducer) for dictionary representation including more strings on output (ambiguities). is" Determinization of the transducer generally unrealizable (the class of deterministic transducers with an output is a proper subclass of nondeterministic transducers); for purposes of natural language processing, though, usually doesn't occur (there aren't cycles). Petr Sojka PV030 Textual Information Systems Index System Implementation is" An addition of a state to a transducer corresponding (wi.w^) without breaking the deterministic property: first a state for [w-],e), then with resulting state final state with output W2- is" Efficient method, quick, however not minimal; there are minimizing algorithms, that lead to spatially economical solutions. is" Procedure: splitting of dictionary, creation of det. transducers with p outputs, their minimization, then a deterministic unification of transducers and minimizing the resulting. is" Another use also for the efficient indexing, speech recognition, etc. Petr Sojka PV030 Textual Information Systems Part XII Coding Petr Sojka PV030 Textual Information Systems ■sr Coding. vs- Entropy, redundancy. is" Universal coding of the integers. vs- Huffman coding. Adaptive Huffman coding. Petr Sojka PV030 Textual Information Systems Definition: Alphabet A is a finite nonempty set of symbols. Definition: Word [string, message) over A is a sequence of symbols from A. Definition: Empty string e is an empty sequence of symbols. A set of nonempty words over A is labeled A+. Definition: Code K is a triad (5, C, f), where 5 is finite set of source units, C is finite set of code units, f : 5 —> C+ is an injective mapping. f can be expanded to S+ —> C+: F(SiS2 ... S^) = f[5^)f{52) ■ ■ ■ f[5k). C+ is sometimes called code. Petr Sojka PV030 Textual Information Systems Definition: x G C+ is uniquely decodable regarding f, if there is maximum one sequence y G S+ so, that f(y) = x. Definition: Code K = (5, C, f) is uniquely decodable if all strings in C+ are uniquely decodable. Definition: A code is called a prefix one, if no code word is a prefix of another. Definition: A code is called a suffix one, if no code word is a suffix of another. Definition: A code is called a affix one, if it is prefix and suffix code. Definition: A code is called a full one, if after adding of any additional code word a code arises, that isn't uniquely decodable. Petr Sojka PV030 Textual Information Systems Definition: Block code of length n is such a code, in which all code words have length n. Example: block ? prefix block prefix, but not vice versa. Definition: A code K = (5,C,f) is called binary, if |C| = 2. Petr Sojka PV030 Textual Information Systems Definition: Compression (coding), decompression (decoding): _^ Compression (encoding) original data ^_ Decompression ^_ (decoding) Definition: Compression ratio is a ratio of length of compressed data and length of original data. Example: Suggest a binary prefix code for decimal digits, if there are often numbers 5 a 4, and rarely 5 and 6. compressed data Petr Sojka PV030 Textual Information Systems Let Y be a random variable with a probability distribution p(y) = P(Y = y). Then the mathematical expectation (mean rate) e(v) = £ yp(y)- yer Let 5 = {x/|,X2,..., xn} be a set of source units and let the occurrence probability of unit x,- in information source S is p, for i = A,..., n, n e N. Definition: Entropy of information content of unit x,- (measure of amount of information or uncertainty) is H(x,-) = H,- = — log2 p,- bits. A source unit with more probability bears less information. Petr Sojka PV030 Textual Information Systems Definition: Entropy of information sources is H(S) = — ^ p,- log2 p; bits. True, that H(S) = J p(y) log — = E (jog —) Definition: Entropy of source message X = x-h x,-2 ... x-,k € 5+ of k k information sources is H(X, S) = H(X) = ^ H,- = — ^ log2 p/y bits. Definition: Length l(X) of encoded message X k k /(X) = £|f(x0)| = £^.bits. Theorem: /(X) > H(X,S). Petr Sojka FV030 Textual Information Systems Axiomatic introduction of entropy see [MAR], details of derivation see ftp://www.math.muni.cz/pub/math/people/Paseka/lectures/kodovani.ps Definition: = /(X) - H(X) = ^{djj + log2 p,j) is redundancy of code K for message X. n Definition: Average length of code word K is AL(K) = ^p;d; bits. Definition: Average length of source S is AE(S) = p,H, = - p log2 P/ bits. Definition: Average redundancy of code K is n A/?(K) = AL(K) - AE(S) = £ p,(d, + log2 p) bits. ;=1 ;=1 Petr Sojka FV030 Textual Information Systems Definition: A code is an optimal one, if it has minimal redundancy. Definition: A code is an asymptotically optimal, if for a given distribution of probabilities the ratio AL(K)/AE(S) is close to A, while the entropy is close to co. Definition: A code K is a universal one, if there are c^,C2 & R so, that average length of code word AL(K) < x AE + C2- Theorem: Universal code is asymptotically optimal, if = A. Petr Sojka PV030 Textual Information Systems Definition: Fibonacci sequence of order m F„ = Fn-m + + . . . + Fn^ fom>A. Example: F of order 2: F_/| = 0„ fo = "1, Fi = 1, F2 = 2, F3 = 3, F4 = 5, F5 = .. Example: F of order 3: F_2 = 0, F_/| = 0, Fo = -1, F^ = -1, F2 = 2, F3=4, F4 = 7,F5 = 13,... Example: F of order 4: F_3 = 0, F_2 = 0, F_^ = 0, Fo = A, = A, F2 = 2, F3 = 4,F4 = &,F5 = 15,... Definition: Fibonacci representation R(N) = X/=i ^'F, where 4- G {0, A }, dk = A Theorem: Fibonacci representation is ambiguous, however there is such a one, that has at most m — A consecutive ones in a sequence d\. Petr Sojka PV030 Textual Information Systems Definition: Fibonacci code of order m FKm(N) = d^ d2 ... d^ A ... A , m—1 krát where d\ are coefficients from previous sentence [ones end a word). Example: í?(32) =0*A+0*2 + A*5 + 0*5 + A*& + 0*A5 + A*2A, thusF(32) = 00-10-10-1-1. Theorem: FK(2) is a prefix, universal code with ci = 2, c2 = 3, thus it isn't asymptotically optimal. Petr Sojka PV030 Textual Information Systems gers II ns- unary code a(N) = 00... OA. na- binary code/3(-1) = A,P(2N + j) = P{N)j, j = 0,A. "a- p is not uniquely decodable (it isn't prefix code), us- ternary r(N) = /3(N)#. is- /3'(-1) = 6, P'(2N) = P'{N)0, P'(2N + A) = P'{N)A, t'(N) = P'(N)#. "a- y. every bit P'(N) is inserted between a pair from a(\P(N)\). us- example: |/(6) = OAOOA us- CK = {|/(N) : N > 0} = (0{0, A})* A is regular and therefore it's decodable by finite automaton. Petr Sojka PV030 Textual Information Systems is- y/(N) = a{\P{H)\)P'[H) the same length (bit permutation |/(N)), but more readable is- c> = {/{N) :N>0} = {&A{0,A}k :k>0}\s not regular and the decoder needs a counter « 5(H) = Y(\m\ww is- example: <5"(4) = y(3)00 = OAAOO is- decoder <5": 0 do begin K := £(N)K; N := l\og2(N)] end. Petr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is" Information encoding for communication purposes. is" Despite tumultuous evolution of capacities for data storage, there is still a lack of space, or access to compressed data saves time. Redundancy —> a construction of a minimal redundant code. is" Data model: • structure - a set of units to compression + context of occurrences; a parameters - occurrence probability of particular units. • data model creation; a the actual encoding. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is" A333 Morse, code e by frequency. is" -1949 Shannon, Fano, Weaver. is" -1952 Huffman; 5 bits per character. is" -1979 Ziv-Lempel; compress (Poden, Welsh, Bell, Knuth, Miller, Wegman, Fiala, Green,...); 4 bits per character. 2 of character null, 255, special character 5C is" run-length encoding (RLE) - 5CXCC generalization to any repetitious character *55 —> $SC * 655 «s- MNP Class 5 RLE - CXXX DDDDDBBAAAA -> 5DDDBB4AAA «s- half-byte packing, (EBCDIC, ASCII) SI, SO is" diatomic encoding; replacement of character pairs with one character. is- Byte Pair Encoding, BPE (Gage, -1994) is" pattern substitution is" Gilbert Held: Data & Image Compression Petr Sojka PV030 Textual Information Systems is" Shannon-Fano, -1949, model of order 0, is" code words of length [— log2 pj or [— log2p; + -1J «s- AE < AL < AE + A. is- code tree (2,2,2,2,4,4,8). is" generally it is not optimal, two passes of encoder through text, static —>x Petr Sojka PV030 Textual Information Systems Input: a sequence of n source units Sp], -1 < / < n, in order of nondecreasing probabilities. Output: n binary code words. begin assign to all code words an empty string; SF-SPUT(S) end procedure SF-SPLIT(S); begin if |S| > 2 then begin divide S to sequences 5A and 52 so, that both sequences have roughly the same total probability; add to all code words from 5A 0; add to all code words from 52 -1; SF-SPUT(S-I); SF-SPUT(S2); end end Petr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is" Huffman coding, -1952. is" static and dynamic variants. is" optimal code (not the only possible). is" 0(n) assuming ordination of source units. is" stable distribution —> preparation in advance. Example: (2,2,2,2,4,4,8-) Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Definition: Binary tree have a sibling property if and only if O each node except the root has a sibling, Q nodes can be arranged in order of nondecreasing sequence so, that each node (except the root) adjacent in the list with another node, is his sibling (the left sons are on the odd positions in the list and the right ones on even). Fetr Sojka FV030 Textual Information Systems uffman trees Theorem: A binary prefix code is a Huffman one it has the sibling property. is" 2n — A nodes, max. 2n — A possibilities, is" optimal binary prefix code, that is not the Huffman one. is" AK{X) < pn + 0,086, pn maximum probability of source unit, is" Huffman is a full code, (poor error detection). is" possible to extend to an affix code, KWIC, left and right context, searching for *X*. Petr Sojka PV030 Textual Information Systems Huffman coding Adaptive dictionary methods is- FGK (Faller, Gallager, Knuth) is" suppression of the past by coefficient of forgetting, rounding, -1, is" linear time of coding and decoding regarding the word length. «a= ALHD < 2ALHS- is- Vitter ALhd < ALHS + A. is" implementation details, tree representation code tables. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is" generalization of Huffman coding (probabilities of source units needn't be negative powers of two). is" order of source units; Cumulative probability cp-, = Xj=i Pj source units x,- with probability p,-. is" Advantages: • any proximity to entropy. • adaptability is possible. • speed. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods preeeion Definition: Dictionary is a pair D = (M, C), where M is a finite set of words of source language, C mapping M to the set of code words. Definition: L(m) denotes the length of code word C(m) in bits, for m £ M. Selection of source units: • static (agreement on the dictionary in advance) • semiadaptive (necessary two passes trough text) • adaptive Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Source unit of the length n - n-grams Most often bigrams (n = 2) • n fixed a n variable (by frequency of occurrence) • adaptive (50 % of an English text consits of about A 50 most frequent words) Disadvantages: • they are unable to react to the probability distribution of compressed data • pre-prepared dictionary Fetr Sojka FV030 Textual Information Systems Dictionary Compressed data Compressed dictionary Compressed data Advantages: extensive date (the dictionary is a small part of data corpora; CQP). Petr Sojka PV030 Textual Information Systems HufFman coding Adaptive dictionary methods Semiadaptive dictionary meth procedure ode - dictionary creation O The frequency of N-grams is determined for N = 1,2,.... @ The dictionary is initialized by unigram insertion. Q N-grams with the highest frequency are gradually added to the dictionary. During K-gram insertion frequencies decrease for it's components of (K — -1)-grams, (K — 2)-grams____If, by reducing of frequencies, a frequency of a component is greatly reduced, then it's excluded from the dictionary. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is" Adaptive dictionary methods with dictionary restructuring. is" Syntactic methods. is" Checking of text correctness. is" Querying and TIS models. is" Vector model of documents is" Automatic text structuring. is" Document similarity. Fetr Sojka FV030 Textual Information Systems LZ77 - siliding window methods LZ7& - methods of Increasing dictionary a b c b a b b a a b a c b encoded part not enc. part (window, N < &A92) (|B| ~10-20b) In the encoded part the longest prefix P of a string in not encoded part is searched. If such a string is found, then P is encoded using (/, J, A), where / is a distance of first character 3 from the border, J is a length of the string S and A is a first character behind the prefix P. The window is shifted by J + A characters right. If the substring S wasn't found, then a triple (0,0, A) is created, where A is a first character of not encoded part. Petr Sojka PV030 Textual Information Systems Huffman coding Adaptive dictionary methods |M| = (N - 3) x 3 x t, t size of alphabet L(m) = pog2(N-B)1 + po^l + P^l Advantage: the search of the longest prefix [KMP] • LZR uses a tree containing all the prefixes in the yet encoded part. • The whole encoded yet encoded part is used as a dictionary. • Because the / in (/, j, a) can be large, the Elias code for coding of the integers is used. Disadvantage: a growth of the tree size without any limitation => after exceeding of defined memory it's deleted and the construction starts from the beginning. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods The code is a sequence of pointers and characters. The pointer (/, j) needs a memory as p characters a pointer only, when it pays off, but there is a bit needed to distinguish a character from a pointer. The count of dictionary items is |M| = t + (N - 3) x (B - p) (considering only substrings longer than p). The bit count to encode is • L(m) = A + \log21"| for m G T • L(m) = -1 + flog2N"| + [log2(B - p)] otherways. (The length d of substring can be represented as 3 — p). Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods A pointer (/, j) (analogy to LZSS) If • the window is not full (at the beginning) and • the compressed text is shorter than N, the usage of log2 N bytes for encoding of / is a waste. LZF3 uses phasing for binary coding. - prefix code with increasing count of bits for increasing values of numbers. Elias code y. LZSS, where for pointer encoding the Huffman coding is used (i.e. by distribution of their probabilities => 2 throughpasses) Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods The main idea: the dictionary contains phrases. A new phrase so, that an already existing phrase is extended by a symbol. A phrase is encoded by an index of the prefix and by the added symbol. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Input a b ab c ba Index -1 2 3 4 5 Output (Oa) (0,b) (1.b) (0,c) (2,a) Input bab aa 3333 Index 6 7 Ô 9 Output (5,b) (7,a) (ô,a) Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods A dictionary is stored in a tree structure, edges are labeled with strings of characters. These strings are in the window and each node of the tree contains a pointer to the window and identifying symbols on the path from the root to the node. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods LZW (Welch), LZC The output indexes are only, or • the dictionary is initiated by items for all input symbols • the last symbol of each phrase is the first symbol of the following phrase. Input ababcbababaa a a a Index 4 5 6 7 & 9 AO Output A 2 4 3 5 31 -10 AA Overflow => next phrase is not transmitted and coding continues statically, it's a LZW + • Pointers are encoded with prolonging length. • Once the compression ratio will decrease, dictionary will be deleted and it starts from the beginning. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods LZT, LZMW, LZJ As LZC, but when a dictionary overflows, phrases, that were least used in the recent past, are excluded from the dictionary. It uses phrasing for binary coding of phrase indexes. As LZT, but a new phrase isn't created by one character addition to the previous phrase, but the new phrase is constructed by concatenation of two last encoded ones. Another principle of dictionary construction. • At the beginning only the single symbols are inserted. • Dictionary is stored in a tree and contains all the substrings processed by string of the length up to h. • Full dictionary => • statical procedure, • omitting of nodes with low usage frequency. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods ary restructuring is" Ongoing organization of source units —> shorter strings of the code. is" Variants of heuristics (count of occurrences, moving to the beginning (F3STW), change the previous, transfer of X forward). is" F3STW (advantage: high locality of occurrences of a small number of source units. is" Example: I'm not going to the forest, ..., An2nkn. is" Generalization: recency coefficient, Interval coding. Fetr Sojka FV030 Textual Information Systems Representation of the word by total sum of words from the last occurrence. The dictionary contains words a^, a^, ■ ■ ■, an, input sequence contains X/|, x2,xm. The value LAST(a;) containing the interval form last occurrence is initialized to zero. f or t ■— A to m do begin {xt = a,} if LAST(xt LAST(xt):=t end . Sequence y^, y2, ..., ym is an output of encoder and can be encoded by one code of variable length. Petr Sojka PV030 Textual Information Systems = 0) then y(t) = t + / - 1 else y(t) = t - LAST(xt); Huffman coding Adaptive dictionary methods is" the grammar of the message language is known, is" left partition of derivational tree of string, is" global numbering of rules, is" local numbering of rules. is" Decision-making states of LR analyzer are encoded. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is" fixed context - model of order N. is" combined approach - contexts of various length. m P(x) = HLo wnPn{x)- is" wn fixed, variable. is" time and memory consuming. is" assignment of probability to the new source unit: e = ^-p, ■ is" automata with a finite context, is" dynamic Markov modeling. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Checking the correctness of the text is" Checking of text using frequency dictionary. is" Checking of text using a double frequency dictionary. is" Interactive control of text (ispell). is" Checking of text based on regularity of words, weirdneee coefficient. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Weirdneee coefficient of trigram xyz KPT = [log(f(xy) - -1) + log(f(yz) - 1)]/2 - log(f(xyz) -A), where f(xy) resp. f(xyz) are relative frequencies of bigram resp. trigram, log(O) is defined as —10. We/rdness coefficient of word KP5 = ^J^J^PTi - 5KPT2), where KPT; is a weirdness coefficient of i-th trigram SKPT is a mean rate of weirdness coefficients of all trigrams contained in the word. Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods us* Querying and TIS models. bs" Boolean model of documents. «3° Vector model of documents. «*■ TIS Architecture. n®" Signature methods. us* Similarity of documents. is" Vector model of documents (completion). os- Extended boolean model. «s" Probability model. is" Model of document clusters. us" TIS Architecture. is" Automatic text structuring. 83° Documents similarity. is" Lexicon storage. ns= Signature methods. _ Petr Sojka PV030 Textual Information Systems Different methods of hierarchization and document storage different possibilities and efficiency of querying. is" Boolean model, SQL. is" Vector model. is" Extended boolean types. is" Probability model. is" Model of document clusters. Petr Sojka PV030 Textual Information Systems The search lies in reducing of uncertainty of a question. Q We find a document with high relevance. Q We start to query with it's key words. O We remove descriptors, or replace them with disjunctions. Petr Sojka PV030 Textual Information Systems Boolean model : querying System http: //inf omap. Stanford, edu- for working with searched meaning/concept (as opposed to mere strings of characters). Right query formulation is the half of the answer. The search lies in determination of semantically closest terms. Petr Sojka PV030 Textual Information Systems is" Fifties: representation of documents using sets of terms and querying based on evaluation of boolean expressions. is" Query expression: inductively from primitives: • term • attribute.name = attribute.value (comparison) • function_name(term) (application of function) and also using parentheses and logical conjunctions X and Y, X or Y, X xor Y, not Y. is" disjunctive normal form, conjunctive normal form is" proximity operators is" regular expressions is" thesaurus usage Petr Sojka PV030 Textual Information Systems Boolean model ns" boolean operators and, or, xor, not. os* positional operators adj, (n) words, with, eame, syn. i®3 SQL extension: operations/queries with use of thesaurus F3T(A) Broader term NT(A) Narrower term PT(A) Preferred term SYN(A) Synonyms of the term A RT(A) Related term TT(A) Top term Petr Sojka PV030 Textual Information Systems ORACLE SQL*TEXTRETRIEVAL SELECT specification_of_items FROM specification_of_tables WHERE item CONTAINS textov_expression Example: SELECT TITLE FROM BOOK WHERE ABSTRACT CONTAINS 'TEXT' AND RT(RETRIEVAL) 'string' 'string'* *'string' 'st?ing' 'str7«ing' 'stringa' (m,n) 'stringb' 'multiword phrases' BT('string',n) BT('string', *) NT('string',n) Petr Sojka PV030 Textual Information Systems Example: SELECT NAME FROM EMPLOYEE WHERE EDUCATION CONTAINS RT(UNIVERSITA) AND LANGUAGES CONTAINS 'ENGLISH' AND 'GERMAN' AND PUBLICATIONS CONTAINS 'BOOK' OR NT('BOOK\*) Petr Sojka PV030 Textual Information Systems >lean model i factor [fN-AB-N/2)2N aeoC(QA,QB) = lo^0 AB[N_A)[N_B) A - number of documents „hit" by the query Q/\ B - number of documents „hit" by the query Qg (its relevance we count) f - number of documents „hit" by both the queries N - total sum of documents in TIS cutoff (relevant/ Irrelevant) clustering/nesting A. generation, 2. generation,... Petr Sojka PV030 Textual Information Systems Vector model of documents: let a^,... ,an be terms, D^,... ,Dm documents, and relevance matrix W = (w/j) of type m, n, wu e (0,1) 0 is irrelevant -1 is relevant Query (2 = (,-) = X/ ty^/j similarity coefficient • head(sort(5(Q, £>,-))) answer Petr Sojka FV030 Textual Information Systems ■ CONS: doesn't take into account ?"and"? ?"or"? PROS: possible improvement: • normalization of weights • Term frequency TF • Inverted document frequency IDF = log2 f • Distinction of terms • normalization of weights for document: , rp • normalization of weights for query: ^ x J^Ir? ^ x log2 j [POK, pages 05-113]. Fetr Sojka FV030 Textual Information Systems is" Interrelations between documents in TIS. is" Encyclopedia (OSN, Funk and Wagnalls New Encyclopedia). «s- [SF3A] http://columbus.cs.nott.ac.uk/compsci/epo/epodd/ep056gs is" Google/CiteSeer: „automatic structuring of text files" Petr Sojka PV030 Textual Information Systems is" Most often cosine measure - advantages. is" Detailed overview of similarity functions see chapter 5.7 from [K.OR] (similarity). Petr Sojka PV030 Textual Information Systems ■ ^ [MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural language Engineering, 2(A):6A-&0, A996. http://www.research.att.com/~mohri/cll.ps.gz Petr Sojka PV030 Textual Information Systems