Basic info PV030 Textual Information Systems Petr Sojka Faculty of Informatics Masaryk University, Brno Spring 20-12 Part Information about the course PV030 O Textual Information Systems quisites and classification e syllabus tu re course • Petr Sojka, sojkaOf i .muni . cz • Consulting hours Spring 2012: Wednesday 13:00-13:45 Friday-10:00-1-1:50 or write an email with other suggestions to meet. • Room C523/522, fifth floor of block C, Botanická 6£>a. 9 Couree homepage: http://www.fi.muni.cz/~sojka/PV030/ » Seminar (Thu -12:00-12:50, C511 -» B311). Prerequisites: It is expected the student having basic knowledge of the theory of automata and formal languages (IB005), elementary knowledge of theories of complexity, software programming and systems. Classification: There is a system of classification based on written mid-term (30 points) and final (70 points) exams. In addition, one can get additional premium points based on the activities during lectures. Classification scale (changes reserved) z/k/E/D/C/B/A correspond to obtaining 4Ö/54/60/66/72/7Ö/Ö4 points. Dates of [final] exams will be announced via IS.muni.cz (probably three terms). Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems My books focus on timeless truth. D.E. Knuth, Brno, -1996 An emphasis will be given to the explanation of basic principles, algorithms and (software) design techniques, creation and implementation of textual information systems (TIS)—storage and information retrieval. PetrSojka PV030 Textual Information Systems Prerequisites and classification Basic info Course syllabus Literature © Basic notions. TIS (text information system). Classification of information systems. From texts to Watson. © Searching in TIS. Searching and pattern matching classification and data structures. Algorithms of Knuth-Morris-Pratt, Aho-Corasick, reg. expr. © Algorithms of Boyer-Moore, Commentz-Walter, Buczilowski. © Theory of automata for searching. Classification of searching problems. Searching with errors. © Indexes. Indexing methods. Data structures for searching and indexing. © Google as an example of search and indexing engine. Pagerank. Signature methods. © Query languages and document models: Boolean, vector, probabilistic, MMM, Paice. Petr Sojka Basic info PV03 P re re Cours Liter* 0 Textual Information Systems quisites and classification >e syllabus 3ture Textbooks © Data compression. Basic notions. Entropy. © Statistical methods. ® Compression methods based on dictionary. O Syntactic methods. Context modeling. Language modeling. Corpora linguistics. © Spell checking. Filtering information channels. Document classification. Neural nets for text compression. [MEL] Melichar, B.: Textové informační systémy, skripta ČVUT Vraha, 2, vydání, A 996. [POK] Vokorný, J., Snášel, V., Húsek D:. Dokumentografícké informační systémy, Karolinum Vraha, A99&. [KOR] Korfhage, R. R.: information Storage and Retrieval, Wiley Computer Publishing, -1997. [5MY] Smyth, B.: Computing Patterns in Strings, Addison Wesley, 2005. [KNU] Knuth, D. E.: The Art of Computer Programming, Vol. 5, Sorting and Searching, Second edition, A99&. [WMB] Witten I. H., Moffat A., Bell T. C: Managing Gigabytes: compressing and indexing documents and images, Second edition, Morgan Kaufmann VuWishers, A99&. PetrSojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Basic info Prerequisites and classification Course syllabus Literature ler study mater ials ■ (cont.) ^ [HEL] Held, G.: Data and Image Compression, Tools and Techniques, John Wiley & Sons, 4. vydání -1996. 13 [MEH] Melichar B., Holub J., A 6D Classification of Pattern Matching Probléme, Proceedings of The Prague Stringology Club Workshop 97, Prague, July 7, CZ, lul [GOO] Brin 5., Page, L: The anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7/Computer Networks 30(1-7): A07-AA7 (A99&). http://dbpubs.Stanford.edu:8090/pub/1998-8 3 [MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural Language Engineering, 2{A):6A-8>0,A996. http://www.research.att.com/~mohri/cll.ps.gz jl [Sch] Schmidhuber J.: Sequential neural text compression, IEEE Transactions on Neural Networks 7(A), -142—146, -1996, http://www.idsia.ch/~juergen/onlinepub.html H [SBA] Salton G., Buckley Ch., Allan J.: Automatic structuring of text files, Electronic Publishing data T T information need <—> query is1 Abstractions and mappings in information systems. is1 Information needs about the reality—queries above data. is1 Jeopardy game: Watson. Definition: Information eyetem is a system that allows purposeful arrangement of collection, storage, processing and delivering of information. Definition: Ectosystem consists of IS users, investor of IS, and entrepreneur (user, funder, server). In the example of is.muni.cz they are users of IS, MU represented by bursar, and ICS and IS teams. Ectosystem is not under control of IS designer. Definition: Endosystem consists of hardware weed {media, devices), and software (algorithms, data structures) and is under control of IS designer. is1 effectiveness (user) is1 economics (funder) is1 efficiency (server) and from different preferences implied compromises. Our view will be view of TIS architect respecting requests of IS ectosystem. For topics related to ectosystem of IS see PV045 Management IS. • Data: concrete representation of a message in a form of sequence of symbols of an alphabet. • Information: reflection of the known or the expected substance of realities. An information depends on the intended subject. Viewpoints: • quantitative (information theory); • qualitative (meaning, eemantice); a pragmatical (valuation: eignificance, ueefulneee, usability, periodicity, up-to-datene55, credibility; • the othere (promptneee, particularity, completeneee, univocality, availability, co5t5 of obtaining). • Knowledge (znalost). • Wisdom (moudrost). Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Notions and classification of 15 (T)IS classification Information retrieval systems Notions of (T)IS, FV030 in the context of teaching at Fl MU Notions and classification of IS (T)IS classification Mini questionnaire Information retrieval systems ?y the prevailing function questionnaire Definition: Information proceee is a process of formation of information, its representation in a form of data, its processing, providing, and use. Operations with information correspond to this process. Data/signals —> Information —> Knowledge —> Wisdom. © Information retrieval systems. © Database management systems (DBMS), relational DB (PB154, PB155, PVOC3, PV055, PV136, PB114). © Management information systems (PV045). © Decision support systems (PV09Ö). © Expert systems, question answering systems, knowledge-based systems (PA03A). © Information service systems (web 2.0). Fetr Sojka Notions and classification of IS (T)IS classification Information retrieval systems FV030 Textual Information Systems i questionnaire Fetr Sojka Notions and classification of IS (T)IS classification Information retrieval systems FV030 Textual Information Systems questionnaire the prevailing function (cont.) rspectives © Specific information systems (geographical PV019, PA049, PA050, medical PV04Ö, environmental PV044, corporate PV043, state administration PV05Ö, PV059, librarian PV070); and also PV063. Application of database systems. Related fields taught in Fl: Software engineering (PAA02, PAA05). Similarity searching in multimedia data (PM2&). Efficient use of database systems (PM 52). Introduction to information retrieval (PV2AA). information retrieval system Expert system DBMS Database system Management system Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Notions and classification of IS (T)IS classification Information retrieval systems Mini questionnaire Notions and classification of IS (T)IS classification Information retrieval systems Class sification and formalization of IRS rieval systems (IRS)—principles © What do you expect from this course? What was your motivation to enroll? Is the planned syllabus fine? Any changes or surprises? © What do you not expect (you would rather eliminate)? © Which related courses have you already passed? © Practising IS usage (as a user) a) Which (T)IS do you use? b) Intensity? Frequency? How many searching per month? c) Are you satisfied with it? © IS creation (server) a) Which (T)IS and its component have you realized? Area, size? b) Are you satisfied with it? Bottlenecks? Fetr Sojka Notions and classification of IS (T)IS classification Information retrieval systems FV030 Textual Information Systems Classification and formalization of IRS DB I Definition of documents —> Output files format definition —> SEARCH Search method ENGINE definition —> Content of documents —> T Queries Set of selected documents DB t Query — * Search — —> Set of selected engine documents Fetr Sojka Notions and classification of IS (T)IS classification Information retrieval systems FV030 Textual Information Systems Classification and formalization of IRS rmalization of the problem Concatenation: string of beads. A bead —> an element. Indexing of elements by natural numbers. Not necessarily numbers, but labels. 0) Every element has unique label. A) Every labeled element x (except for the leftmost one) has a clear predecessor referred to as pred(x). Every labeled element x (except for the rightmost one) has a clear successor referred to as succ(x). 3) If the element x is not the leftmost one, x = succ(pred(x)). 4) If the element x is not the rightmost one, x = predieuccix)). 5) For every two different elemente x and y, there exists a positive number k that is either x = eucck(y) or X = Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Notions and classification of 15 (T)IS classification Classification and formalization of IRS Information retrieval systems Searching—formalization of the problem (cont.) The concatenation term: Definition: a string is a set of elements which meets the rules 0)-5). Definition: a linear string: a string that has a finitely many elements including the leftmost and rightmost ones. Definition: a necklace. Definition: an alphabet A. Letters of the alphabet. A+. An empty string e. Definition: a Unite chain A* = A+ U {e}. Definition: a linear string over A: a member of A+. Definition: a pattern. A text. © Classification according to the passing direction: left-to-rig ht/right-to-left. © Classification according to (pre)processing of the text and the pattern: • ad fontes (searching in the text itself); • text surrogate (searching in the substitution of the text); • substitutions: an index: an ordered list of significant elements together with references to the original text; a signature: a string of indicators that shows the occurrence of significant elements in the text. Fetr Sojka Notions and classification of IS (T)IS classification Information retrieval systems cation (cont.) Clasi FV030 Textual Information Systems Classification and formalization of IRS text ^reprocessing no yes pattern preprocessing no 1 III yes II IV I - elementary algorithms II - creating a search engine III - indexing methods IV - signature methods Fetr Sojka Notions and classification of IS (T)IS classification Information retrieval systems FV030 Textual Information Systems Classification and formalization of IRS e formulation of the problem Classification according to the cardinality of the patterns' set: © Search for a single pattern V in the text T. The result: yes/no. © Search for a finite set of patterns P = {i^, 1/2,..., v^}. The result: information about position of some of the entered patterns. © Search for an infinite set of patterns assigned by a regular expression K. K defines a potentially infinite set L(K). The result: information about position of some of the patterns from L(K). Alternatives to the formulation of the searching problem: a) the first occurrence; b) the all occurrences without overlapping; c) the all occurrences including overlapping. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Rudimentary search algorithm K.arp-Rabin search algorithm Na'iVe search, brute force search, rudimentary search algorithm Exact search FetrSojka PV030 Textual Information Systems I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Rudimentary search algorithm K.arp-Rabin search algorithm Time complexity analysis of naive search proc Brute-Force-Matcher(PATTERN,TEXT): T:=length[TEXT]; P:=length[PATTERN]; for i:=0 to T-P do if PATTERN[1..P]=TEXT[i+l..i+P] then print "The pattern was found at the position Fetr Sojka . SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm Igorithms FV030 Textual Information Systems mentary search algorithm 9 The complexity is measured by number of comparison, the length of a pattern P, the length of text T. • The upper estimate S = P • (T - P + -1), thus 0(P x 7"). • The worst case PATTERN = ap~^b, TEXT = ar"1b. • Natural languages: (average) complexity (number of comparison) substantially smaller, since the equality of prefixes doesn't occur very often, for English: S = Cb ■ (T — P + A), Cb empirically measured A.07, i.e. practically linear. 9 CCz? CCz vs. CE? • Any speedups? An application of several patterns? An infinite number? • We will see the version (S, Q, Q') of the algorithm in the seminar. express the time complexity of the following search algorithms using the variables c and s, where c is the number of the tests and these statements are true: • if the index / is found, then c = / and s = A; 9 otherwise, c = T and s = 0. FetrSojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems 1. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm Rudimentary search algorithm Na' we search—algorithm 5 input: var TEXT : array [1..T] of word; PATTERN : word; output (in the variable FOUND): yes/no A I:=l; c while I< T do begin c if TEXT[I]=PATTERN then break; c-s inc(I) ; end; 2 FOUND:=(IT+1) word; In this case, the index is always found; therefore it is stated on the last but one line of the algorithm that the complexity is c — A instead of c — 5 (although they are equivalent). Furthermore, it is necessary to realize that the maximal possible value of c is greater by one than in the previous algorithm (stating c + A instead of c would not be correct, though). The overall complexity: 0(7") = 2c + 3. The maximum complexity: 0(7") = 27" + 5. Fetr Sojka . SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm FV030 Textual Information Systems Rudimentary search algorithm r how about using the cycle expansion input: var TEXT : array [1..T+l] of word; PATTERN : word; output (in the variable FOUND): yee/no A I:=l; A TEXT[T+1]:=PATTERN; [c/2] while TEXT [I] OPATTERN do begin [c/2j if TEXT[1+1]=PATTERN then break; L(c-1)/2J I:=I+2; end; 3 FOUND: = (KT) or (TEXT [T] =PATTERN) ; The overall complexity: 0(7") = c + [_{c — A)/2] +5. The maximum complexity: 0(7") = 7" + [X/2] + 6. The condition at the end of the algorithm guarantees its functionality (however, it is not the only way of handling the cycle incrementation by two). Fetr Sojka . SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm FV030 Textual Information Systems Rudimentary search algorithm © Watson © Exact search methods I (without pattern preprocessing) -completion. © Exact search methods II (with pattern preprocessing, left to right): KMP (animation), Rabin-Karp, AC. © Search with an automaton. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Rudimentary search algorithm K.arp-Rabin search algorithm Evaluation of questionnaire I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm © Yes: syllabus suits expectations; positively is awaited dissect of Google; indexing and search; examples. © No: too much theory, deep digestion of algorithms. © Examples. © This year: further enrichment of information retrieval part (Google), textual (mathematical) digital libraries and languages enhancements of TIS (on the example of Watson). © Search in text editor (Vim, Emacs), in the source code of a web page. © Data search (biological molecules approximated as sequences of nucleotides or amino acids). © Literature/abstracts search—recherche, corpus linguistics. The size of available data doubles every AQ> months (Moore's law) —> higher effectiveness of algorithms needed. Fetr Sojka FV030 Textual Information Systems 1. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm Lei :t-to-right direct search methods Fetr Sojka I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm :t-to-right methods al Information Systems During the preprocessing, structure of the query pattern(s) is examined and, on that basis, the search engine is built (on-the-fly). Definition: exact (vs. fuzzy (proximitni)) search aims at exact match (localization of searched pattern(s)). Definition: left-to-right (LR, eouemerne) (vs. right-to-left (RL, protiemerne)) search compares query pattern to the text from left to right (vs. right to left). © A query pattern (vzorek): • Shift-Or algorithm. • Karp-Rabin algorithm, (KR, A9&7). • Knuth-Morris-Pratt algorithm, (KMP, designed (MP) in ■1970, published -1977). © n patterns: Aho-Corasick algorithm, (AC, -1975). © co patterns: construction of a search engine (finite automaton) for the search of a potentially infinite set of patterns (given as regular expression). Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm Igorithm , SE without preprocessing both patterns and the text II - Exact search with query preprocessing K.arp-Rabin search algorithm algorithm (cont.) - example if V] = a j is1 Pattern v^V2...vm over an alphabet I = a-i, is1 Incidence matrix X (m x c), X, = •< ? J ^ 1 otherwise. is1 Let matrix column X corresponding to aj is named is1 At the beginning, we put unitary vector/column into K. In every algorithm, step K moves down by one line/position, top-most position is filled by zero and one character aj is read from input. Resulted K is combined with Aj by binary disjunction: K := SH[FT(R)ORAj. is1 Algorithm stops successfully when 0 appears at the bottom-most position in K. Example: V = vzorek over X = \e, k, o, r, v, z}. Cf. [POK, page 31-32]. Fetr Sojka I. SE without preprocessing both patterns and the text II - Exact search with query preprocessing Karp-Rabin search algorithm FV030 Textual Information Systems Fetr Sojka , SE without preprocessing both patterns and the text II - Exact search with query preprocessing Karp-Rabin search algorithm FV030 1 Textual Information Systems rch (cont.)—implementation Quite different approach: usage of hash function. Instead of matching of pattern with text on every position, we check the match only when pattern 'looks similar' as searched text substring. For similarity, a hash function is used. It has to be is1 efficiently computable, is1 and it should be good at separating different strings (close to perfect hashing). KK search is quadratic at the worst case, but on average 0(T + V). #define REHASH (a, b, h) (((h-a*d) «l+b) void KR(char *y, char *x, int n, int m) { int hy, hx, d, i; /* preprocessing: computation of d = 2m~A */ d=l; for (i=l; i text index j := A; > pattern index while (/ < T) and (j < V) do while (j > 0) and [text\J] ^ pattern[j]) do j ■■ h[j]; end while i:=i + A; j:=j + A end while found := j > V; > if found, it is on the position / — V Fetr Sojka FV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems (K)MP Search engine (finite automaton) Construction of the K.MP engine (K)MP Search engine (finite automaton) Construction of the K.MP engine ?-Pratt algorithr "S" 0(7") complexity plus complexity of preprocessing (creation of the array h). is1 Animation of tracing of the main part of KMP. is1 h is used when prefix of pattern v^v2 ■ ■ ■ Vj-i matches with substring of text t,_j+-|t,_j+2 • • • t,--i and Vj ^ t,-. is1 May I shift by more than A? By j? How to compute h? is1 f)(j) the biggest k < j such that v^v2 ■ ■ ■ ^--i is suffix of v^v2... Vj-i, e.g. v^v2... vk-^ = Vj-MVj-k+2 ... and Vj ^ vk. is1 KMP: backward transitions for so long, so that j = 0 (prefix of pattern is not contained in the searched text) or t\ = Vj is1 Animation Lecroq, also [POK, page 27], also see [MAR] for detailed description. Petr Sojka (K)MF Search engine (finite automaton) Construction of the K.MP engine of h for KMP FV030 Textual Information Systems Fetr Sojka (K)MF Search engine (finite automaton) Construction of the K.MP engine irch algorithm, FV030 Textual Information Systems i:=l; j:=0; h[l]:=0; while (i0) and (v[i]<>v[j]) do j:=h[j]; i:=i+l; j:=j+l; if (i<=V) and (v[i]=v[j]) then h[i] :=h[j] else h[i] :=j (*MP*) end; Complexity of h computation, e.g. preprocessing, is 0{V), thus in total 0{T + V). Example: hfor ababa. KMPvs. MP. that uses transition table q derived from the searched pattern, (g relates to the transition function 6 of FA): var i,T:integer; found: boolean; text: array[l..T] of char; state,qO: TSTATE; g:array[1..maxstate,1..maxsymb] of TSTATE; F: set of TSTATE;... begin found:= FALSE; state:= qO; i:=0; while (i <= T) and not found do begin i:=i+l; state:= g[state,text[i]]; found:= state in F; end; end; How to transform pattern into g? FetrSojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems (K)MP Search engine (finite automaton) Construction of the K.MP engine e (SE) for left-to-right search (n)MF Search engine (finite automaton) Construction of the K.MP engine SE for left-to-right search A = (Q, T, g, h, q0, F) • Q is a finite set of states. • 7" is a finite input alphabet. • g: Q x 7" —> QU {fai|} is a forward state-transition function. • fi: (Q — cjo) —> <3 is a backward state-transition function. • cjo is an initial state. • F is a set of final states. "S" A depth of the state q: d{q) G No is a length of the shortest forward sequence of the state transitions from qo to q. is1 Characteristics g, h: • g(qo,a) ^ f§Ji^or G 7" (there is no backward transition in the initial state). • If h(q) = p, then d{p) < d{q) (the number of the backward transitions is restricted from the top by a multiple of the maximum depth of the state cand the sum of the forward transitions V). So the speed of searching is linear in relation to V. Fetr Sojka (K)MF Search engine (finite automaton) Construction of the K.MP engine tion, transition FV030 Textual Information Systems Fetr Sojka (K)MF Search engine (finite automaton) Construction of the K.MP engine FV030 Textual Information Systems is1 SE configuration (q, w), q G Q, w G T* the not yet searched part of the text. is1 An initial configuration of 5E {qo, w), w is the entire searched text. is1 An accepting configuration of 5E (q, w), q£ F,w is the not yet searched text, the found pattern is immediate])/ before w. SE transition: relation hC (Q x T*) x(Qx T*): • g{q,a) = p, then (q,aw) h (p, w) forward transition for \/w G T*. • h(q) = p, then (q, w) h (p, tv) backward transition for \/w G f*. During the forward transition, a single input symbol is read and the engine switches to the next state p. However, if g{q,a) = fail, the backward transition is executed without reading an input symbol. S = 0(7") (we measure the number of SE transitions). Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems © An initial state qo- © q{q, vj+y\) = q', where q' is equivalent to the prefix v-\ V2 ■ ■ ■ vfj+A ■ © For qo, we define q{qo, a) = qo for Va, for which q(qo, a) has not been defined in the previous step. © q{q, a) = fajj for \/q and a, for which q{q, a) has not been defined in the previous steps. © A state that corresponds to the complete pattern is the final one. © The backward state-transition function h is defined on the page 51 by the below mentioned algorithm. © Summary of the previous lecture, searching with SE. © Left-to-right search of n patterns algorithms. (AC, NFA —> DFA.) © Left-to-right search of infinite patterns algorithms. © Regular expressions (RE). © Direct construction of (N)FA for given RE. Petr Sojka PV030 Textual Information Systems Search of n patterns Aho-Corasick algorithm Finite automata for searching Petr Sojka Search of n patterns Aho-Corasick algorithm Finite automata for searching PV030 Textual Information Systems ž of patterns Search of a finite set of patterns SE for left-to-right search of a set of patterns p = {vA ,v2,..., vF}. instead of repeated search of text for every pattern, there is only "one" pass (FA). PetrSojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Search of n patterns Aho-Corasick algorithm Finite automata for searching rithm Search of n patterns Aho-Corasick algorithm Finite automata for searching thm (con1 var text: array[l..T] of char; i: integer; found: boolean; state: tstate; g: array[1..maxstate,1..maxsymbol] of tstate; h: array[1..maxstate] of tstate; F: set of tstate; found:=false; state:=q0; i:=0; while (i<=T) and not found do begin i:=i+l; while g[state,text[i]]=fail do state:=h [state]; state:=g[state,text [i]]; found:=state in F end 9 Construction of the state-transition functions h, g? • How about for P patterns? The main idea? 9 Aho, Corasick, -1975 (AC search engine). Construction of g for AC SE for a set of patterns p = {i/1, v2,..., vp} © An initial state qo- © g{q, bj-H) = q', where q' is equivalent to the prefix b-\bz ■ ■ ■ bj+i of the pattern v', for V/ £ {A,..., P}. © For qo, we define g{qo, a) = qo for Va, for which g{qo, a) has not been defined in the previous steps. © g{q, a) = fai| for \/q and a, for which g{q, a) has not been defined in the previous steps. © A state that corresponds to the complete pattern is the final one. An example: p ={he, she, her} over 7" ={h, e, r, s, x}, where x is anything else than {h, e, r, s}. Construction of h for AC SE for a set of patterns p = {i/1, i/2,..., vF} At first, we define the failure function f inductively relative to the depth of the states this way: © For Mq of the depth A, f{q) = qo- © Let us assume that f is defined for each state of the depth d and leeeer. The variable qo denotes the state of the depth d and g{qo,a) = q'. Then we compute f(q') as follows: q ■= %D); while g{q, a) = fai| do q := f(q); Petr Sojka FV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Search of n patterns Aho-Corasick algorithm Finite automata for searching on h (AC II, cont.) Search of n patterns Aho-Corasick algorithm Finite automata for searching • The cycle terminates, since g(cjo. ) ^ iM-0 If the states cj, r represent prefixes u, / of some of the patterns from p, then f(q) = r <=> v is the longest proper suffix u. AMD Haid)) /(?') 9ß Fetr Sojka Search of n patterns Aho-Corasick algorithm Finite automata for searching FV030 Textual Information Systems Fetr Sojka Search of n patterns Aho-Corasick algorithm Finite automata for searching <0 Textual Information Systems onstruction of h for AC SE for a set of patterns = {/I,v2>...,vp} (cont.) for AC SE (cont.) We could use f as the backward state-transition function h, however, redundant backward transitions would be performed. We define function /i inductively relative to the depth of the states this way: o For V state cj of the depth A, h(q) = cjo • Let us assume that h is defined for each state of the depth d and lesser. Let the depth q be d + A. If the set of letters, for which is in a state f(q) the value of the function q different from fail is the subset of the set of letters, for which is the value of the function gin a state q different from fail, then h(q) := h(f(q)), otherwise h(q) := f(q). Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Search of n patterns Aho-Corasick algorithm Finite automata for searching Search of n patterns Aho-Corasick algorithm Finite automata for searching ■a for searchi Deterministic finite automaton (DFA) M=(K,T,6,qo,F) © K is a finite set of inner states. © 7" is a finite input alphabet. © 6 is a projection from K x 7" to K. © cjo £ K is an initial state. © F C K is a set of final states. © Completely specified automaton if 6 is defined for every pair (q,a) G K x T, otherwise incompletely specified automaton. © Configuration M is a pair (q, w), where q G K, w G 7"* is the not yet searched part of the text. © An initial configuration M is {qo, w), where w is the entire text to be searched. © An accepting configuration M is (q, w), where q£ F and w G T*. During the transition, a single input symbol is read and the engine switches to the next state p. is1 Transition M: is defined by a state and an input symbol; relation hC (K x f*) x (K x f*); if 2K £(q, a) is now a set of states. Definition: hG (K x 7"*) x(Kx T*) transition: if p G c%,a), then (q, aw) h (p, w) for Vtv G T*. Definition: a final state, L(M) analogically as in DFA. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Search of n patterns Aho-Corasick algorithm Finite automata for searching of SE (DFA) from NFA Search of n patterns Aho-Corasick algorithm Finite automata for searching fSE (DFA) from NFA (cont.) Theorem: for every nondeterministic finite automaton M=(K,T,6,qo,F), we can build deterministic finite automaton M''=(K'J\6'',q'0, F') such that L(M) = L(M/). A constructive proof (of the algorithm): Input: nondeterministic FA M = (K, T,8, qo, F). Output: deterministic FA. © ^'={{qo}}> state {qo} in unmarked. © If there are in K' all the states marked, continue to the step 4. © We choose from K' unmarked state q': 9 6'(q',a) = \]{S(p, a)}for,Vp G q' and a G 7"; • K' = K'U • For V/ G {-1,..., P}, we define q(q, bj+^) = q', where q' is equivalent to the prefix b-\bz ■ ■ ■ bj+i of the pattern v'. 9 The state corresponding to the entire pattern is the final one. © .. .and its corresponding DFA M' with q'. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Regular expression (RE) Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Value of RE Definition: Regular expreeeion E over the alphabet A: © 6,0 are RE and for Va £ A is a RE. © If x, y are RE over A, then: • (x + y) 15 (union); • (x.y) is RE (concatenation); • (x)* is RE (iteration). A convention about priority of regular operations: union < concatenation < iteration. Definition: Thereafter, we consider as a (generalized) regular expreeeion even those terms that do not contain, with regard to this convention, the unnecessary parentheses. Petr Sojka Left-to-right methods Derivation of a regular expression Characteristics of regular expressions PV030 Textual Information Systems n of RE (Salomaa A966) M A2 A3 A4 A5 A6 A7 AO A9 MO MA x + (y + z) = (x + y) + z = x + y + z aeeociativity of union x.(y.z) = (x.y).z = x.y.z aeeociativity of concatenation x + y = y + x commutativity of union (x + y).z = x.z + y.z right dietributivity x.(y + z) = x.y + x.z left dietributivity x + x = x idempotence of union 6.x = x identity element for concatenation O.x = 0 inveree element for concatenation x + 0 = x identity element for union x* = e + x*x x* = (e + xY © h{0) = 0, = {g}, h{a) = {a} © • f)(x + y) = f)(x) U f)(y) • f)(x.y) = /i(x)./i(y) • f,(x*) = (f)(X))* car f)(x*) = e U x U x.x U x.x.x U ... is1 The value of RE is a regular language (RL). is1 Every RL can be represented as RE. «■ For V RE V 3 FA M: f)(V) = L(M). Petr Sojka Left-to-right methods Derivation of a regular expression Characteristics of regular expressions k four) PV030 Textual Information Systems © Summary of the previous lecture. © Regular expressions, value of RE, characteristics. © Derivation of regular expressions. © Direct construction of equivalent DFA for given RE by derivation. © Derivation of regular expressions by position vector. © Right-to-left search (F3MH, ON, BUC). Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions egular expressio Left-to-right methods Derivation of a regular expression Characteristics of regular expressions regular expressi Theorem: the axiomatization of RE is complete and consistent. Definition: regular expressions are termed as similar, when they can be mutually conversed using axioms M to MA. Theorem: similar regular expressions have the same value. Definition: the length d(E) of the regular expression E: © If £ consists of one symbol, then d(E) = A. © d{VA + V2) = d{VA) + d(V2) + A. © d{VA.V2) = d{VA) + d{V2) + A. © d{V*) = d{V) + A. © d((V)) = d(V) + 2. Note: the length corresponds to the syntax of a regular expression. Fetr Sojka Left-to-right methods Derivation of a regular expression Characteristics of regular expressions of N FA for given RE FV030 Textual Information Systems Fetr Sojka Left-to-right methods Derivation of a regular expression Characteristics of regular expressions FV030 Textual Information Systems n of N FA for given RE (a proof) © E = a Definition: a generalized NFA allows s-transitions (transitions without reading of an input symbol). Theorem: for every RE E, we can create FA M such that f)(E) = L(M). Proof: by structural induction relative to the RE E: E = E\ automaton for E^ (/i(E^) = L(M^)) 6 Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions Left-to-right methods Derivation of a regular expression Characteristics of regular expressions f N FA for given RE (cont. of a proof) n of NFA for given RE (cont.) E = E-i • E2 © E = E-i + £2 M-i, M2 automata for E-i, E2 (/i(£-i) = ^ KB2) = L(M2)) is1 No more than two edges come out of every state. is1 No edges come out of the final states. is" The number of the states M < 2 • d(£). is1 The simulation of automaton M is performed in 0(d(E)T) time and in 0(d(E)) space. Fetr Sojka Left-to-right methods Derivation of a regular expression Characteristics of regular expressions FV030 Textual Information Systems Fetr Sojka Left-to-right methods Derivation of a regular expression Characteristics of regular expressions ion (cont.) FV030 Textual Information Systems For the following methods of NFA simulation, we must remove the s-transitions. We can achieve it with the well-known procedure: We represent a state with a Boolean vector and we pass through a the paths at the same time. There are two approaches: is1 The general algorithm that use a transition table. is1 Implementation of the automaton in a form of (generated) program for the particular automaton. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Left-to-right methods Derivation of a regular expression Characteristics of regular expressions ction of (N)FA for given RE Left-to-right methods Derivation of a regular expression Characteristics of regular expressions ruction of (N)FA for given RE (cont.) Let £ is a RE over the alphabet T. Then we create FA M = (K, 7", c7, cj0, F) such that h(E) = L(M) this way: © We assign different natural numbers to all the occurrences of the symbols of T in the expression E. We get £'. © A set of starting symbols Z = {x; : a string of /i(E') can start with the symbol x;, x; ^ s}. © A set of neighbours P = {x,yj : symbols x; ^ s ^ y,- can be next to each other in a string of /i(E')}. © A set of ending symbols F = {x; : a string of /i(E') can end with the symbol x; ^ s}. © A set of states K = {q0}UZU {yj : x.-y,- € P}. © A transition function 6: • ^(q0>x) contains x,for,Vx; G Z that originate from numbering of x. • vk} OUTPUT: CW search engine METHOD: we construct the function q and introduce the evaluation of the individual states w: O An initial state qo; w{qo) = £■ Q Each state of the search engine corresponds to the suffix bmbm+^ .. .bn of a pattern v, of the set P. let us define q{q, a) = q', where q' corresponds to the suffix abmbm+^ ... bn of a pattern v,: w{q) = bn... bm+^bm; w{q') = w{q)a. O q{q,a) = fajlfvr every q and a, for which q{q, a) was not defined in the step 2. Q Each state, that correspond to the full pattern, is a final one. Fetr Sojka FV030 Textual Information Systems Right-to-left search of one pattern Example: P = {cacbaa, aba, acb, acbab, ccbab}. LMIN = 3,- a b c X char A A 2 4 Hi) shiftA s/i/ft2 s A 3 a A 2 b A 3 aa 3 2 ab A 2 be 2 3 ba -1 -1 aab 3 2 aba 3 2 bca 2 2 bab 3 -1 aabc 3 2 babe 3 -1 aabca 3 2 babca 3 -1 babec 3 -1 aabcac 3 2 Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy Search for an infinite set of patterns Petr Sojka PV030 Textual Information Systems Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy Right-to-left search for an inf. set of patterns (cont.) © Right-to-left search of an infinite set of patterns © Two-way jump automaton - a generalization of the so far learned left-to-right and right-to-left algorithms. © Hierarchy of the exact search engines. Petr Sojka PV030 Textual Information Systems Definition: reversed regular expression is created by reversion of al concatenation in the expression. Example: reversed RE for £ = bc(a + a*bc) is £K = (a + cba*)cb: Bucziiowski: we search for E such that we create EK and we use it for determination of shift[5TATE, SYMBOL} for each state and undefined transition analogically as in the CW algorithm: a b c X 0 1 3 • 1 1 1 2(3!) 2 1 3 1 1 1 4 1 1 1 1 5 1 1 1 Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy aton Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy omaton II Definition: 2DFA5 is M = (Q, I, where Q a set of states Z an input alphabet 5 a projection. QxI^Qx {—1,-1.....k} cjo G Q an initial state fe G N max. length of a jump Q U Z a jump symbol F C Q a set of final states Definition: a configuration of 2DFA5 is a string of Z*QZ* | 2*. Definition: we denote a set of configurations 2DFA5 M as K(M). Example: £H a2 • • •c\a\...a^ \ aj...a„ G K(M) : čtecí hlava L značka skoku Definition: a transition of2DFA5 is a relation h C K(M) x K(M) such that "3° a-i ... a-j-^a-, q ... a^ \ aj...an \~ a-\ ... o{ ■ ■ ■ aj-<\ \ aj... an for i > A,5(q, a-l+^) = {c(, —A) (right-to-left comparison), »3° a-i .. .a, cj ... \ a^.. .a„ V- a^ ... ... at_-i cj' \ at...an for S(q,aj+^ = (q/,m), m>A, t = min{;' + m,n + A} (right-to-left jump), is* a-\ .. .aj q aj+i ...\ a-,.. .a„ ha^ ...ajaj+^ ...at-i o[ \ at.. .a„ for fi(q,aj) = (q/,m), m>A, t = min{/ + m,n + A} (left-to-right jump),. "3° 3^ ... __| CJ <9j . . . t <9/<9/-H • • • an l~ ■ ■ ■ °[ aj ■ ■ ■ T ai+^ ... an for i > A, 5{q, af) = {c(,A) (left-to-right comparison). (Left-to-right rules are for the left-to-right engines and vice versa.) Definition: \-k, h* analogically as in the SE. Fetr Sojka Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy erarchy FV030 Textual Information Systems Fetr Sojka Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy Definition: the language accepted by the two-way automaton M = (Q, I, & q0, k, T, F) is a set L(M) = {tv G I* : q0 T T" h* n/fxtv T, where f £py £ I*,x G I}. Theorem: L(M) for 2DKA5 M is regular. Example: formulate a right-to-left search of the pattern BANANA in the text l-WANT-TO-FLAVOUR-NATURAL-BANANAS using F3M as 2DFAS and trace the search as a sequence of configurations of the 2DFAS. Let us have a regular expression K = A (0 + A* 02) over the alphabet A = {0,1,2}. is1 Construct a right-to-left DFA K (Bucziiowski) and compute the failure function. Draw the transition graph of this automaton including the failure function visualization. is1 Express the resulting automaton as 2DFAS and trace searching in the text A A 20A OA 2A 02. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy exact search Right-to-left search for an inf. set of patterns Generalization of SE Search engine hierarchy 2DFAS DFA BUC CO i i AC cw k i i KMP BM 1 direction — # of patterns. © Fuzzy (proximity) search. Metrics for measurement of distance of strings. © Classification of search: 6D space of search problems. © Examples of creation of search engines. © Completion of the chapter about searching without text preprocessing. © Indexing basics. Fetr Sojka PV030 Textual Information Systems Fuzzy search: metrics Classification of search problems Fetr Sojka FV030 Textual Information Systems Fuzzy search: metrics Classification of search problems Metrics (for proximity search) Proximity search How to measure (metrics) the similarity of strings? Definition: we call d : 5 x 5 —> K metrics if the following is true: O d{x,y)>0 O d{x, x) = 0 O d(x,y) = d(y,x) (symmetry) Q d(x, y) = 0 => x = y (identity of indiscernibles) O d(x, y) + ^(y, z) > d(x,z) {triangle inequality) We call the values of the function d {distance). Fetr Sojka FV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Fuzzy search: metrics Classification of search problems Definition: let us have strings X and Y over the alphabet X. The minimal number of editing operation for transformation X to Y is is1 Hamming distance, ^-distance, when we allow just the operation Replace, is1 Levenshtein distance, DIP-distance, when we allow the operations Delete, Insert and Replace, "S" Generalized Levenshtein distance, DIRT-distance, when we allow the operations Delete, Insert, Replace and Transpose. Transposition is possible at the neighbouring characters only. They are metrics, Hamming must be performed over strings of the same length, Levenshtein can be done over the different lengths. Petr Sojka PV030 Textual Information Systems Fuzzy search: meines SFO£CO,QFO£CO,SSO£CO Classification of search problems QSOECO, SFORCO, SFODCO Classification of search problems Definition: Let 7" = • • • t„ and pattern P = p^p2 ■ ■ ■ pm. For example, we can ask: O is P a substring of 7"? O is P a subsequence of 7"? Q is a substring or a subsequence P in 7"? O is P in T such that D(P, X) < k for k < m, where X = t,... tj is a part of T (0 is K, DIP or DIPT)? O is a string P containing don't care symbol 0 {*) in T? Q is a sequence of patterns P in T? Furthermore, the variants for more patterns, plus instances of the search problem yes/no, the first occurrence, all the overlapping occurrences, all the also non-overlapping occurrences. example: Find such an example of strings X and Y, that simultaneously holds P{X, Y) = 5, D\P{X, Y) = 5, and D\PT{X, Y) = 5, or prove the non-existence of such strings. Example: find such an example of strings X and Y, that holds simultaneously P{X, Y) = 5, D\P{X, Y) = A, and D\PT{X, Y) = 3, or prove the non-existence of such strings. Example: find such an example of strings X and Y of the length 2n, neN, that P{X, Y) = 2n and a) DIP{X, Y) = 2; b) D\PT[X, Y) = f Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Fuzzy search: metrics Classification of search problems SFOECO, QFOECO, SSOECO 3S0EC0, SFORCO, SFODCO on of search problems (cont.) Fuzzy search: metrics Classification of search problems SFOECO, QFOECO, SSOECO QSOECO, SFORCO, SFODCO SE creation Dimension 1 2 3 4 5 6 5 F 0 E C 0 Q S F D S I D G In total 2x2x3x4x2x2 = 192 search problems classified in a six-dimensional space. For example, SFO??? denotes all the SE for search of one (entire) string. For all these problems, we are going to learn how to create NFA for searching. Example: let P = p^p2P5 ■ ■ ■ Pm> m = 4, A is any character of Z. NFA for SFOECO: A Petr Sojka PV030 Textual Information Systems Fuzzy search: metrics Classification of search problems FOECO, QFOECO, SSOECO SOECO,SFORCO, SFODCO eec\uence of characters Example: NFA for QFOECO (seQuence): A p2 p3 p is any character of Z except for p. Automaton has m + A states for a pattern of the length m. A Fuzzy search: metrics Classification of search problems ) Textual Information Systems ECO, QFOECO, SSOECO IECO, SFORCO, SFODCO substring: NFA for SSOECO Definition: This automaton is called initial e-treelie and has (m + A) + m + (m - A) + • • • + 2 = IliH^l states. Fetr Sojka FV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Fuzzy search: metrics Classification of search problems subsequence FOECO, QFOECO, SSOECO ECO, SFORCO, SFODCO Fuzzy search: metrics Classification of search problems Example: NFA for QSOECO is similar, we just add some cycles for non-matching characters and e transitions to all the existing forward transitions [or we concatenate the automaton m-times). Definition: Automaton for QSOECO is called e-treelie. SFOECO, QFOECO, SSOECO QSOECO, SFORCO, SFODCO arch of SFORCO Example: SFORCO for Hamming (/\) distance, /c < 3: the number of the levels corresponds to the distance. Fetr Sojka PV030 Textual Information Systems Fuzzy search: metrics SFOECO, QFOECO, SSOECO Classification of search problems QSOECO, SFORCO, SFODCO Proximi ty search of SFORCO Definition: This automaton is called R-treelie, and has (m + A) + m + (m - A) + ■ ■ ■ + (m - k + A) = (k + A)(m + A - ^) states. The number of the level of the final state corresponds to the length of the found string from the pattern. A Fuzzy search: metrics Classification of search problems PV03Í SFOE( QSOE <0 Textual Information Systems CO, QFOECO, SSOECO ICO, SFORCO, SFODCO arch of SFODCO for DlR-dletance Example: SFOD3CO for Levenshtein (DIR) distance, k < 3: additional "D-edges" and "/-edges". Pi P? P^ P4 Petr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Fuzzy search: metrics SřOECO, QFOECO, SSOECO Claeeification of search probléme QSOECO, SFORCO, SFODCO SFOGCO Fuzzy search: metrice Classification of search problems Outline (Week five) CO, QFOECO, SSOECO SCO, SFORCO, SFODCO For the DIRT-distance, we add more new states to the SFODCO automaton that correspond to the operation of transposition and also the corresponding pair of edges for every transposition. Animation program by Mr. Pojer for the discussed search automata is available for download from the course web page and is also installed in B311. Simulation of NFA or determinisation? A hybrid approach. The Prague Stringology Club and its conference series: see http: //www. stringology. org/. © Searching with text preprocessing; indexing methods. © Methods of indexing. © Automatic indexing, thesaurus construction. © Ways of index implementation. Large amount of texts? The text preprocessing! is1 Index, indexing methods, indexing file, indexsequential file. is1 Hierarchical structuring of text, tagging of text, hypertext. "S" Questions of word list storing [lexicon) and occurrence (hit) list storing, their updating. Petr Sojka PV030 Textual Information Systems ndexing Methods Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems is1 granularity of the index items: document-- word paragraph - sentence docA doc2 docb word A A O A word 2 A A O word 3 O A A "S" inverted file, transposition docA doc2 docb word A A 0 A word 2 A A 0 word 3 0 A A word4 A A A wordA A A A is1 Word order (primary key) in index —> binary search Time complexity of one word searching in index: n index, length, V pattern length 0(V x log2(n)) is1 searching for k words, pattern p = v^,..., K„Pravda" • a:[word= „Zena|Muž|Clověk"] []* [lemma = a.lemma] basic queries • „Havel"; 45: Český prezident Václav se včera na 09: jak řekl Václav , každý občan 24Ô: více než rokem řekl Pravda vítězí regular expressions • „Pravda | pravda"; • ,,(P|p)ravda"; • ,,(P|p)ravd[a,u,o,y]"; • „pravd.*"; „pravd.+"; „post?el"; word sequence • ,,prezident(a|u)" ,,Havl(a|ovi)"; • „a tak"; • „prezident"; []* „Havel"; • „prezident" („republiky" „Václav")? „Havel"; Petr Sojka PV030 Textual Information Systems Index System Implementation ching Large computational power of today's computers enables: • efficient storing of large amount of text data (compression, indexing); • efficient search for text strings. A man sitting behind a computer uses all this, to obtain from so processed documents information, that he is interested. Really? Example: In text database there is stored a few last years of daily newspaper. I'd like to obtain information about president Vaclav Havel. a/>HAVEL b/>more precise queries Computer (computational power) + human (inteligence) valuable information The goal of + time + money everybody —> is to transfer the largest possible part of intelligence (time, money, to computer. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Index System Implementation System Implementation on from natural language information ideal of ideals no Searching pragmatic analysis context no information Correct semantic analysis sentence meaning TIL starting-up Spell translation syntactic analysis grammar CFG, DCG partially check morphological analysis word-forming lemma yes Check Simple translat words are strings of letters string searching yes Do we really know, what is information contained in the text in natural language? • František Novák weighs -100 kg. -> object property value ' František Novák likes beer. František Novák likes Jana Novotná ' F. N. is an old honest man. Spring erupted in full force. 2 RDB attribute, attribute, key value Words of the natural language denote objects, their properties and relations between them. It's possible to see the words and sentences also as ..functions" of its kind, defined by their meaning. • A man, who climbed a highest Czech eight-thousander, is my grandson. "S1 Corpus: electronic collection of texts, often indexed by linguistic tags. is1 Corpus as a text information system: corpus linguistics. is1 BNC, Penn Treebank, DESAM, PNK,...; ranges from millions to billion positions (words), special methods necessary. is1 Corpus managers CQP, GCQP, Manatee/Bonito, http://www.fi.muni.cz/~pary/ see [MAR]. Definition: Corpus is a large, internaly structured compact fie of texts in natural language electronically stored and processable. 9 Indian languages have no script - for a finding of a grammar it's necessary to write up the spoken word. • -1967 -A. corpus in U. S. A. (kučera, Francis) A 000000 words. • Noam Chomsky - refuses corpora. • Today - massive expansion. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Index System Implementation • WWW page of Pavel Rychlý (~pary) links to basic information. Bonito, Manatee. • IMS CORPUS WORKBENCH - a toolkit for efficient representation and querying over large text files. Fetr Sojka Index System Implementation FV030 Textual Information Systems Internal architecture of corpus Two key terms of internal representation of position attributes are: • Uniform representation: items for all attributes are encoded as integer numbers, where the same values have the same digital code. A sequence of items is then represented as a sequence of integers. Internal representation of attribute word (as well as of any other pos. attribute) is array (0. .p-1) of Integer, where p is position count of corpus. • Inverted file: for a sequence of numbers representing a sequence of values of a given attribute, the inverted file is created. This file contains a set of occurrences in position attribute for every value (better value code). Inverted file is needed for searching, because it directly shows a set of occurrences of a given item, the occurrences then can be counted in one step. Index System Implementation Sequence of words at numbered positions (first word, nth word), to which tags are added (addition of tags called corpus tagging). Tags are morphological, grammatical and any other information about a given word. It leads to more general concept of position attributes, those are the most important tagging type. Attributes of this class have a value (string) at every corpus position. To every of them one word is linked as a basic and positional attribute word. In addition to this attribute, further position attributes may be bundled with each position of any text, representing the morphological and other tags. Structural attributes - sentences, paragraphs, title, article, SGML. P0S2 P0S1 LEMMA český prezident vadav have! dnes WORD Českého prezidenta Václava Havla dnes 0 12 3 File with encoded attribute values and inverted file as well have auxiliary files. 9 The first data structure is a list of items or Jexicon": it contains a set of different values. Internally it's a set of strings occurring in the sequence of items, where a symbol Null (octal 000) is inserted behind every word. The list of items already defines a code for every item, because we suppose the first item in the list to have a code 0, following A etc. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Internal architecture of corpus (cont.) There are three data structures for the inverted file: • The first is an independent inverted file, that contains a set of corpus positions. • The second is an index of this file. This index returns for every code of item an input point belonging to an occurrence in inverted file. • The third is a table of frequency of item code, which for each item code gives a number of code occurrence in corpus (that is of course the same as the size of occurrence set). Index System Implementation Preprocessing of text and pattern (query): overwhelming majority of today's TIS. Types of preprocessing: "S" n-gram statistics (fragment indexes). "S" special algorithms for indexes processing (coding, compression) and relevance evaluation (PagePank Google) "S" usage of natural language processing methods (morphology, syntactic analysis, semantic databases) an aggregation of information from multiple sources (systems AnswerBus, STAPT). "S" signature methods. SELECTED ITEMS Accuracy tp+ fp+fn+tn Recall (Sensitivity) = T^j7l Precision (Positive Predictive Value) = -^-^ False Positive Rate = f^tn False Negative Rate — lp+jn Specificity = ln+y Negative Predictive Value = tJ+fn Definition: Relevance (of answers to a query) is a rate range, by which a selected document coincides with requirements imposed on it. Ideal answer = real answer Definition: Coefficient of completeness (recall) R = ^, where m is a count of selected relevant records and n is a count of all relevant records in TIS. Definition: Coefficient of precision P = ^, where o is count of all selected records by a query. We want to achieve maximum K and P, tradeoff. Standard values: gO% for P, 20% for R. Combination of completeness and precision: coefficient Fj, = ^p£^jp- (Po = P Pco = P, where F^ = F P and R weighted equally). Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Index System Implementation Fragment index < System Implementation is1 The fragment ybd is in English only in the word molybdenum. is1 Advantages: fixed dictionary, no problems with updates. is1 Disadvantages: language dependency and thematic area, decreased precision of search. is1 Google as an example of web-scale information system. is1 Jeff Dean's video - historical notes of Google search developments. is1 Google - system architecture. is1 Google - PageRank. is1 Google File System. is1 Implementation of index systems An example of anatomy of global (hyper)text information system (www.google.com). is1 1997: google.stanford.edu, students Page and Brin is1 1990: one of few quality search engines, whose basic fundamentals and architecture (or at least their principles) are known - therefore a more detailed analysis according to the article [GOO] http://www7.conf.au/programme/fullpapers/1921coml921.htm. is1 2012: clear leader in global web search "S" Several innovative concepts: PageRank, storing of local compressed archive, calculation of relevance from texts of hypertext links, PDF indexing and other formats, Google File System, Google Link... "S" The system anatomy, see [MAR] Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems The crucial thing is documents' relevance (credit) computation. "S" Usage of tags of text and web typography for the relevance calculation of document terms. "S" Usage of text of hyperlink is referring to the document. Petr Sojka PV030 Textual Information Systems Index System Implementation "S" PaqeKank: objective measure of page importance based on citation analysis (suitable for ordering of answers for queries, namely page relevance computation). "S" Let pages T^,.. .,Tn [citations) point to a page A, total sum of pages is m. PageRank (A-d) , JPR(U) , PR(Tn) PR{A) = d m C(Tn) "S" PageRank can be calculated by a simple iterative algorithm (for tens of millions of pages in hours on a normal PC). "S" PageRank is a probability distribution over web pages. "S" PageRank is not the only applied factor, but coefficient of more factors. A motivation with a random surfer, dumping factor d, usually around 03d. Petr Sojka PV030 Textual Information Systems Index System Implementation plementatio ka PV030 Textua on >n is1 Storing of file signatures is1 Storing of lexicon is" Storing of hit list. is1 Google File System is1 Inverted file - indexing file with a bit vector. "S" Usage of document list to every key word. is1 Coordinate system with pointers [MEL, fig. 4.A&, page 46]. is1 Indexing of corpus texts: Finlib http: //www. f i .muni . cz/~pary/dis .pdf see [MAR]. is1 Use of Elias coding for a compression of hit list. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Index System Implementation implementation (cont.) Index System Implementation Dictionary repr 'esentation by FA 1 is1 Efficient storing of index/dictionary [lemmas]: packed trie, Patricia tree, and other tree structures. "S" Syntactic neural network (S. M. Lucas: Rapid best-first retrieval from massive dictionaries, Pattern Recognition Letters 17, p. -1507-15-12, -1996). is1 Commercial implementations: Verity engine, most of web search engines - with few exceptions - hide their key to success. Article M. Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing see [MAR] is1 Dictionary representation by finite automaton. is1 Ambiguities, unification of minimized deterministic automata. is1 Example: done,do.V3:PP done,done.AO is1 Morphological dictionary as a list of pairs [word form, lemma]. is1 Compaction of storing of data structure of automata (Liang, 19&3). is1 Compression ratio up to -1:20 in the linear approach (given the length of word). is1 Transducer for dictionary representation. is1 Deterministic transducer with A output (subsequential transducer) for dictionary representation including one string on output (information about morphology, hyphenation,...). is1 Deterministic transducer with poutputs (p—subsequently! transducer) for dictionary representation including more strings on output (ambiguities). is1 Determinization of the transducer generally unrealizable (the class of deterministic transducers with an output is a proper subclass of nondeterministic transducers); for purposes of natural language processing, though, usually doesn't occur [there aren't cycles). Petr Sojka Index System Implementation PVC >30 Textual Information Systems Dictionary representation by FA 1 is1 An addition of a state to a transducer corresponding (ia^.ia^) without breaking the deterministic property: first a state for [w^,e), then with resulting state final state with output W2- is1 Efficient method, quick, however not minimal; there are minimizing algorithms, that lead to spatially economical solutions. is1 Procedure: splitting of dictionary, creation of det. transducers with p outputs, their minimization, then a deterministic unification of transducers and minimizing the resulting. is1 Another use also for the efficient indexing, speech recognition, etc. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Coding is" Coding. is1 Entropy, redundancy. is1 Universal coding of the integers. is1 Huffman coding. is1 Adaptive Huffman coding. Definition: Alphabet A is a finite nonempty set of symbols. Definition: Word {string, message) over A is a sequence of symbols from A. Definition: Empty string e is an empty sequence of symbols. A set of nonempty words over A is labeled A+. Definition: Code K is a triad (5, C, f), where 5 is finite set of source units, C is finite set of code units, f : 5 —> C+ is an injective mapping, f can be expanded to S+ —> C+: F(S-i S2 • • • 5k) = f (S-i )f (S2) • • • f (S/<). C+ is sometimes called code. Definition: x £ C+ is uniquely decodable regarding f, if there is maximum one sequence y G S+ so, that f (y) = x. Definition: Code K = (S,C,f) is uniquely decodable if all strings in C+ are uniquely decodable. Definition: A code is called a prefix one, if no code word is a prefix of another. Definition: A code is called a suffix one, if no code word is a suffix of another. Definition: A code is called a affix one, if it is prefix and suffix code. Definition: A code is called a full one, if after adding of any additional code word a code arises, that isn't uniquely decodable. Petr Sojka FV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Definition: Compression (coding), decompression (decoding): Definition: Block code of length n is such a code, in which all code words have length n. Example: block ? prefix block => prefix, but not vice versa. Definition: A code K = (5,C,f) is called binary, if \C\ = 2. original data Compression (encoding) compressed data <- Decompression (decoding) <- Definition: Compression ratio is a ratio of length of compressed data and length of original data. Example: Suggest a binary prefix code for decimal digits, if there are often numbers 3 a 4, and rarely 5 and 6. Let Y be a random variable with a probability distribution p(y) = P(Y = y). Then the mathematical expectation (mean rate) yer Let 5 = {x-i,X2,... ,xn} be a set of source units and let the occurrence probability of unit x,- in information source S is p, for i = A,...,n, n e N. Definition: Entropy of information content of un/tx,- (measure of amount of information or uncertainty) is H(x,-) = H, = — log2 p, bits. A source unit with more probability bears less information. Definition: Entropy of information sourceS is H(S) = — ^ p, log2 p, bits. < A f A True, that H(5) = ^ p(y) log — = £ [Jog — Definition: Entropy of source message X = xhx,2 ... x,,. £ k information sources is H(X, S) = H(X) = ^ H, = — ^ log2 p,\ bits. Definition: Length l(X) of encoded message X k k l(X) = ^\f(xij)\ = ^dijbits. Theorem: l(X) > H(X,S). Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Axiomatic introduction of entropy see [MAR], details of derivation see ftp://www.math.muni.cz/pub/math/people/Paseka/lectures/kodovani.ps k Definition: K{X) = l{X) — H{X) = ^(^ + log2 Pij) 15 redundancy of code K. for message X. n Definition: Average length of code word K is AL{K) = ^pid, bits. Definition: Average length of source S is n n AE{5) = ]T piH, = - p, log2 p, bits. Definition: Average redundancy of code K is n AR{K) = AL(K) - AE(S) = £p;(4 + log2p;) bits. Definition: A code is an optimal one, if it has minimal redundancy. Definition: A code is an asymptotically optimal, if for a given distribution of probabilities the ratio AL(K)/AE(S) is dose to A, while the entropy is close to co. Definition: A code K is a universal one, if there are c^,c2 &K so, that average length of code word AL(K) < x AE + c2. Theorem: Universal code is asymptotically optimal, if = A. Definition: Fibonacci sequence of order m Fn = Fn_m + Fn_m_H + ... + Fn_A for n > A. Example: F of order 2: F_-| = 0„ Fq = A, F^ = A, F2 = 2, F3 = 3, F4 = 5, F5 = .. Example: F of order 3: F_2 = 0, F_-| = 0, Fq = A, F^ = A, F2 = 2, F3 = 4,F4 = 7,F5 = 13,... Example: F of order 4: F_3 = 0, F_2 = 0, F_^ = 0, Fo = A, F^ — A, F2 = 2, F3 = 4, F4 = &,F5 = 15,... Definition: Fibonacci representation R{N) = ^j=J[ d\F\, where d E {0,A},dk = A Theorem: Fibonacci representation is ambiguous, however there is such a one, that has at most m — A consecutive ones in a sequence d\. Definition: Fibonacci code of order m FKm(N) = d^d2 ... dk A .. .A , m—A krát where d, are coefficients from previous sentence (ones end a word). Example: £(32) = 0*1+0*2 + 1*3 + 0*5 + 1*0 + 0*13 + 1*21, thus F(32) = 00101011. Theorem: FK(2) is a prefix, universal code with = 2,c2 = 3, thus it isn't asymptotically optimal. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems TI ers II us- unary code a(N) = 00.. .01. «s* binary code P(A) = 1,|3(2N + j) = |3(N)j, j = 0,1. na* |3 is not uniquely decodable (it isn't prefix code), us- ternary t(N) = P(N)#. m- p>(A) = 6,P'{2M) = P'{H)0,p/{2N + 1) = ^/(N)/I, t'(N) = £'(N)#. "3" v: every bit|3'(N) is inserted between a pair from a(\P(N)\). us- example: v(6) = 01001 us- CK = (v(N) : N > 0} = (0{0,1})*1 is regular and therefore it's decodable by finite automaton. is- /(N) = ailPityDfy{N) the same length (bit permutation p(N)), but more readable car Cy/ = {y'(N) : N > 0} = {OH {0,1 }k : > 0} is not regular and the decoder needs a counter «■ 0 do begin K := /3(N)K; N:= Uog2(N)J end. FV030 Textual Information Systems Huffman coding Adaptive dictionary methods V030 Textual Information Systems is1 Information encoding for communication purposes. is1 Despite tumultuous evolution of capacities for data storage, there is still a lack of space, or access to compressed data saves time. Redundancy —> a construction of a minimal redundant code. is" Data model: • structure - a set of units to compression + context of occurrences; • parameters - occurrence probability of particular units. • data model creation; • the actual encoding. "S" 1£3£> Morse, code e by frequency. "S" 1949 Shannon, Fano, Weaver. "S" 1952 Huffman; 5 bits per character. "S" 1979 Ziv-Lempel; compress (Roden, Welsh, Bell, Knuth, Miller, Wegman, Fiala, Green,...); 4 bits per character. us1 eighties and nineties PPM, DMC, gzip (zlib), SAKDC; 2-3 bits/character "sf at the turn of millenium bzip2; 2 bits per character. igf ? Petr Sojka FV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Huffman coding Adaptive dictionary methods Huffman coding Adaptive dictionary methods 1950 1960 1970 19Ô0 1990 2000 YEAR Petr Sojka PV030 Textual Information Systems is1 models of order 0 - probabilities of isolated source units (e.g. Morse, character e) is1 models with a finite context - Markov models, models of order n (e.g. Bach), P{a\x^X2 ■ ■ ■ xn) is1 models based on finite automata • synchronization string, nonsynchronization string • automaton with a finite context • suitable for regular languages, unsuitable for context-free languages, P{a\c\j) is1 redundancy (non-uniform probability of source unit occurrences) is1 encoder, decoder, model is1 statistical modeling (the model doesn't depend on concrete data) is1 semiadaptive modeling (the model depends on data, 2 passes, necessity of model transfer) is1 adaptive modeling (only one pass, the model is created dynamically by both encoder and decoder) is1 Huffman coding. is1 Adaptive Huffman coding. is1 Aritmetic coding. is1 Dictionary methods. "S" Signature methods. is1 Similarity of documents. is1 Compression using neural networks. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Character techniques is1 null suppression - replacement of repetition > 2 of character null, 255, special character 5C is1 run-length encoding (RLE) - 5CXCC generalization to any repetitious character *55 —> $SC * 655 MNP Class 5 PLE - CXXX DDDDDBBAAAA -» bDDDBBAAAA ■sp half-byte packing, (EBCDIC, ASCII) SI, SO is1 diatomic encoding; replacement of character pairs with one character. us- Byte Pair Encoding, BPE (Gage, -1994) "S" pattern substitution is1 Gilbert Held: Data & Image Compression m Input: a sequence of n source unite S[/], A < i < n, in order of nondecreasing probabilities. Output: n binary code words. begin assign to all code words an empty string; SF-SPLIT(S) end procedure SF-SPLIT(S); begin if \5\ > 2 then begin divide 5 to sequences 5-1 and 52 so, that both sequences have roughly the same total probability; add to all code words from 5A O; add to all code words from 52 -1; SF-SPLIT(SI); SF-SPLIT(S2); end Huffman coding Adaptive dictionary methods Statistical compression methc ode II is1 Shannon-Fano, A949, model of order 0, is1 code words of length log2pJ or log2 p,- + A J /£ < AL < AE + A. vs- code tree (2,2,2,2,4,4,g>). is1 generally it is not optimal, two passes of encoder through text, static ■ *x Huffman coding Adaptive dictionary methods is" Huffman coding, -1952. "S" static and dynamic variants. «■ AEPL = dWpW- is1 optimal code (not the only possible). "S" O(n) assuming ordination of source units. is1 stable distribution —> preparation in advance. Example: (2,2,2,2,4,4,0) end Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Huffman coding Adaptive dictionary methods Huff mar l codi ng - sibling property Huffman coding Adaptive dictionary methods Huffman trees Definition: Binary tree have a sibling property if and only if Q each node except the root has a sibling, © nodes can be arranged in order of nondecreasing sequence so, that each node (except the root) adjacent in the list with another node, is his sibling (the left sons are on the odd positions in the list and the right ones on even). Theorem: A binary prefix code is a Huffman one <=> it has the sibling property. is1 2n — A nodes, max. 2n — A possibilities, is1 optimal binary prefix code, that is not the Huffman one. "S" AR(X) < pn + 0,0&d, pn maximum probability of source unit. is1 Huffman is a full code, (poor error detection). is1 possible to extend to an affix code, KWIC, left and right context, searching for *X*. FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is" F(3K (Faller, Gallager, Knuth) is1 suppression of the past by coefficient of forgetting, rounding, A, 2 n r, r , r . is1 linear time of coding and decoding regarding the word length. «■ ALHD < 2ALH5. Vitter ALHD < ALH5 + A. is1 implementation details, tree representation code tables. is1 generalization of Huffman coding (probabilities of source units needn't be negative powers of two). is1 order of source units; Cumulative probability cp-, = Pj source units x,- with probability p,-. is1 Advantages: • any proximity to entropy. • adaptability is possible. • speed. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Huffman coding Adaptive dictionary methods Dictionary methods of data compression Definition: Dictionary is a pair D = (M, C), where M is a finite set of words of source language, C mapping M to the set of code words. Definition: L(m) denotes the length of code word C(m) in bits, for m e M. Selection of source units: • static (agreement on the dictionary in advance) 9 semiadaptive (necessary two passes trough text) • adaptive Huffman coding Adaptive dictionary methods Source unit of the length n - n-grams Most often bigrams (n = 2) 9 n fixed • n variable (by frequency of occurrence) 9 adaptive (50 % of an English text consits of about 150 most frequent words) Disadvantages: 9 they are unable to react to the probability distribution of compressed data • pre-prepared dictionary Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods | Semiadaptive dictionary methods - dictionary creation procedure Dictionary Compressed data Compressed dictionary Compressed data Advantages: extensive date (the dictionary is a small part of data -corpora; COP). O The frequency of N-grams is determined for N = 1,2,____ O The dictionary is initialized by unigram insertion. O N-grams with the highest frequency are gradually added to the dictionary. During K-gram insertion frequencies decrease for it's components of (K — 1)-grams, (K — 2)-grams____If, by reducing of frequencies, a frequency of a component is greatly reduced, then it's excluded from the dictionary. Fetr Sojka FV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Out li ne ( Week thirteen) Huffman coding Adaptive dictionary methods ary method is1 Adaptive dictionary methods with dictionary restructuring. "S" Syntactic methods. is1 Checking of text correctness. "S" Querying and TIS models. is1 Vector model of documents is1 Automatic text structuring. is1 Document similarity. LZ77 - siliding window methods LZ7S - methode of increasing dictionary a b c b a b b a a b a c b encoded part not enc. part (window, N < SI92) ([B[ ~10-20 b) In the encoded part the longest prefix P of a string in not encoded part is searched. If such a string is found, then P is encoded using (/, J, A), where / is a distance of first character 5 from the border, J is a length of the string 5 and A is a first character behind the prefix P. The window is shifted by J + A characters right. If the substring 5 wasn't found, then a triple (0,0, A) is created, where Ale a first character of not encoded part. al Information Systems Huffman c Adaptive dictionary me rer, Szymaf \M\ = [N - 3) x 3 x t,t size of alphabet L(m) = \\oq2(N-B)-] + \\oq23\\ + \\oq2t\ Advantage: the search of the longest prefix [KMP] • LZP uses a tree containing all the prefixes in the yet encoded part. • The whole encoded yet encoded part is used as a dictionary. 9 Because the / in (/, j, a) can be large, the Elias code for coding of the integers is used. Disadvantage: a growth of the tree size without any limitation => after exceeding of defined memory it's deleted and the construction starts from the beginning. The code is a sequence of pointers and characters. The pointer (/, j) needs a memory as p characters => a pointer only, when it pays off, but there is a bit needed to distinguish a character from a pointer. The count of dictionary items is |M| = t + (N — 3) x [3 — p) (considering only substrings longer than p). The bit count to encode is • L(m) = -1 + flog21] for m eT 9 L(m) = A + pog2/\f] + flog2(r3 - p)"] otherways. (The length d of substring can be represented as 3 — p). Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Huffman coding Adaptive dictionary methods increasing dictionary A pointer (/, j) (analogy to LZSS) If • the window is not full (at the beginning) and 9 the compressed text is shorter than N, the usage of log2 N bytes for encoding of / is a waste. LZF3 uses phasing for binary coding. - prefix code with increasing count of bits for increasing values of numbers. Elias code y. LZSS, where for pointer encoding the Huffman coding is used (i.e. by distribution of their probabilities => 2 through passes) The main idea: the dictionary contains phrases. A new phrase so, that an already existing phrase is extended by a symbol. A phrase is encoded by an index of the prefix and by the added symbol. Fetr Sojka Huffman coding Adaptive dictionary methods LZFG (Fi, ala, Green) 'V030 Textual Information Systems Input a b ab c ba Index 1 2 3 4 5 Output (Oa) (Ob) (1,b) (0,c) (2,a) Input bab aa Index 6 7 ô 9 Output (5,b) (1,a) (7,a) (ô,a) A dictionary is stored in a tree structure, edges are labeled with strings of characters. These strings are in the window and each node of the tree contains a pointer to the window and identifying symbols on the path from the root to the node. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods Huffman coding Adaptive dictionary methods The output indexes are only, or • the dictionary is initiated by items for all input symbols • the last symbol of each phrase is the first symbol of the following phrase. Input ababcbababaa a a a Index 4 5 6 7 & 9-10 Output -12 43 5 S-1 -10-1-1 Overflow => next phrase is not transmitted and coding continues statically, it's a LZW + • Pointers are encoded with prolonging length. • Once the compression ratio will decrease, dictionary will be deleted and it starts from the beginning. As LZC, but when a dictionary overflows, phrases, that were least used in the recent past, are excluded from the dictionary. It uses phrasing for binary coding of phrase indexes. As LZT, but a new phrase isn't created by one character addition to the previous phrase, but the new phrase is constructed by concatenation of two last encoded ones. Another principle of dictionary construction. • At the beginning only the single symbols are inserted. • Dictionary is stored in a tree and contains all the substrings processed by string of the length up to h. • Full dictionary => • statical procedure, a omitting of nodes with low usage frequency. is1 Ongoing organization of source units —> shorter strings of the code. is1 Variants of heuristics (count of occurrences, moving to the beginning (F3STW), change the previous, transfer of X forward). is1 F3STW (advantage: high locality of occurrences of a small number of source units. is" Example: I'm not going to the forest, ..., An2nkn. is1 Generalization: recency coefficient, Interval coding. Representation of the word by total sum of words from the last occurrence. The dictionary contains words a-\, a2< ■■■< <3n, input sequence contains x-1, x2,..., xm. The value LAST(a,-) containing the interval form last occurrence is initialized to zero. for t := A to m do begin {xt = a,} if LAST(xt = 0) then y(t) = t + / - 1 else y(t) = t - LAST(xt); LAST(xt):=t end . Sequence y-1.y2.---.ym |S ^n output of encoder and can be encoded by one code of variable length. Fetr Sojka PV030 Textual Information Systems Fetr Sojka FV030 Textual Information Systems Huffman coding Adaptive dictionary methods is1 the grammar of the message language is known. is1 left partition of derivational tree of string. is1 global numbering of rules. is1 local numbering of rules. is1 Decision-making states of LR analyzer are encoded. is1 fixed context - model of order N. is1 combined approach - contexts of various length. is1 ia/„ fixed, variable. is1 time and memory consuming. is1 assignment of probability to the new source unit: e = is1 automata with a finite context. is1 dynamic Markov modeling. FV030 Textual Information Systems Huffman coding Adaptive dictionary methods rectness of the text V030 Textual Information Systems Huffman coding Adaptive dictionary methods cient is1 Checking of text using frequency dictionary. is1 Checking of text using a double frequency dictionary. is1 Interactive control of text (ispell). is1 Checking of text based on regularity of words, weirdneee coefficient. Weirdneee coefficient of trigram xyz KPT = [log(f(xy) - A) + log(f(yz) - 1)]/2 - log(f(xyz) - 1), where f(xy) resp. f(xyz) are relative frequencies of bigram resp. trigram, log(O) is defined as —AO. Weirdneee coefficient of word KPS = \Jj^{KPTi - 5KPT2), where KPT, is a weirdness coefficient of /-th trigram 5KPT is a mean rate of weirdness coefficients of all trigrams contained in the word. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Huffman coding Adaptive dictionary methods Outline (Week fourteen) Querying and TIS models, is" Boolean model of documents, •s" Vector model of documents, ns TIS Architecture. Signature methods, us" Similarity of documents. «s" Vector model of documents (completion) bs" Extended boolean model, i®" Probability model, is Model of document clusters. TIS Architecture, us" Automatic text structuring, i®" Documents similarity, us" Lexicon storage, is Signature methods. Petr Sojka PV030 Textual Information Systems Blair's que Different methods of hierarchization and document storage different possibilities and efficiency of querying. i®1 Boolean model, SQL. «s" Vector model. is Extended boolean types. Probability model. is" Model of document clusters. Petr Sojka PV030 Textual Information Systems PV03< Boole. ntic querying The search lies in reducing of uncertainty of a question. Q We find a document with high relevance. Q We start to query with it's key words. O We remove descriptors, or replace them with disjunctions. System http: //infomap. Stanford, edu- for working with searched meaning/concept (as opposed to mere strings of characters). Right query formulation is the half of the answer. The search lies in determination of semantically closest terms. Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems "sf Fifties: representation of documents using sets of terms and querying based on evaluation of boolean expressions. ■sf Query expression: inductively from primitives: • term • attribute_name = attribute_value (comparison) • function_name(term) (application of function) and also using parentheses and logical conjunctions X and Y, X or Y, X xor Y, not Y. "sf disjunctive normal form, conjunctive normal form "sf proximity operators ■sf regular expressions "sf thesaurus usage "sf boolean operators and, or, xor, not. "sf positional operators adj, (n) words, with, same, syn. ■sf SQL extension: operations/queries with use of thesaurus F3T(A) broader term NT(A) Harrower term PT(A) Preferred term SYN(A) Synonyms of the term A fsT(A) Related term TT(A) Top term ORACLE SQL*TEXTRETRIEVAL SELECT specification_of_items FROM specification_of_tables WHERE item CONTAINS textov_expression Example: SELECT TITLE FROM BOOK WHERE ABSTRACT CONTAINS 'TEXT' AND RT(RETRIEVAL) 'string' 'string'* *'string' 'st?ing' 'str°/oing' 'stringa' (m,n) 'stringb' 'multiword phrases' BT('string',n) BT('string',*) NT('string',n) Example: SELECT NAME FROM EMPLOYEE WHERE EDUCATION CONTAINS RT(UNIVERSITA) AND LANGUAGES CONTAINS 'ENGLISH' AND 'GERMAN' AND PUBLICATIONS CONTAINS 'BOOK' OR NT('B00K',*) Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems asoc{Q./\,Qö) = log i c (fN- Aß-N/2)2N AB(N — A)(N — 3) A - number of documents „hit" by the query 3 - number of documents „hit" by the query (its relevance we count) f - number of documents „hit" by both the queries N - total sum of documents in TIS cutoff (relevant/ irrelevant) clustering/nesting A. generation, 2. generation,... Vector model of documents: Vet a^,... ,an be terms, D^,... ,Dm documents, and relevance matrix W = of type m, n, WijE (0,1) 0 is irrelevant 1 is relevant Query Q= ..., qn) • 5{Q,Dj) = ^i^Wij similarity coefficient 9 head(sort(S(Q,Dj))) answer CONS: doesn't take into account ?"and"? ?"or"? PROS: possible improvement: • normalization of weights • Term frequency TF • Inverted document frequency IDF = log2 f • Distinction of terme • normalization of weights for document: TD TD1 • normalization of weights for query: [POK, pages &5-113]. •i W — x 2 max TF; is1 Interrelations between documents in TIS. "S" Encyclopedia (OSN, Funk and Wagnalls New Encyclopedia). «■ [SBA] http://columbus.cs.nott.ac.uk/compsci/epo/epodd/ep056gs is1 Google/CiteSeer: ..automatic structuring of text files" x log m 2 k Petr Sojka PV030 Textual Information Systems Petr Sojka PV030 Textual Information Systems Boolean model Lexicon storage Most often cosine measure - advantages. Detailed overview of similarity functions see chapter 5.7 from [KOR] (similarity). [MeM] Mehryar Mohri: On Some Applications of Finite-State Automata Theory to Natural Language Processing, Natural language Engineering, 2(A):