Seminar 1 Exercise 1/1 Recommend a query processing strategy for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) with respect to the following postings list sizes: eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812 We use a database trick where we filter out the results with the clause of the shortest intermediate result first. Operations OR is understood as addition and AND as multiplication. Compose the equations: tangerine OR trees = 46653 + 316812 = 363465 marmalade OR skies = 107913 + 271658 = 379571 kaleidoscope OR eyes = 87009 + 213312 = 300321 After sorting these with respect to sizes and we get the ordering kaleidoscope OR eyes < tangerine OR trees < marmalade OR skies we see that the query is best processed in the following sequence: 1. π‘Ž = kaleidoscope OR eyes 2. 𝑏 = tangerine OR trees 3. 𝑐 = marmalade OR skies 4. 𝑑 = π‘Ž AND 𝑏 5. 𝑒 = 𝑑 AND 𝑐 Exercise 1/2 What is the best order for processing the query ostrich AND hippo AND giraffe if we know that the number of occurrences of the animals are 100, 500, 300, respectively? (ostrich AND giraffe) AND hippo 1 Exercise 1/3 Create an inverted index composed of the following collection of documents: Doc 1: new home sales top forecasts Doc 2: home sales rise in July Doc 3: increase in home sales in July Doc 4: July new home sales rise Very easy procedure. Start with an empty table. If the term already appears in the table as a key, add the document ID only. Otherwise, take each term of a document and add it as a key to the table with the ID of the document. This way we get the inverted index represented in the following table. new 1 4 home 1 2 3 4 sales 1 2 3 4 top 1 forecasts 1 rise 2 4 in 2 3 July 2 3 4 increase 3 Table 1: Inverted index Exercise 1/4 Create an inverted index composed of the following collection of documents: Doc 1: hippo ostrich ostrich giraffe Doc 2: lion frog giraffe hippo Doc 3: ostrich frog bat giraffe lion frog hippo 1 2 ostrich 1 3 giraffe 1 2 3 lion 2 3 frog 2 3 bat 3 Table 2: Inverted index 2 Seminar 2 Algorithm 1 (Soundex Code) Transformation of a string to a 4-character soundex code 1. Keep the first character 2. Rewrite {𝐴, 𝐸, 𝐼, 𝑂, π‘ˆ, 𝐻, π‘Š, π‘Œ } to 0 3. Rewrite characters (a) {𝐡, 𝐹, 𝑃, 𝑉 } to 1 (b) {𝐢, 𝐺, 𝐽, 𝐾, 𝑄, 𝑆, 𝑋, 𝑍} to 2 (c) {𝐷, 𝑇} to 3 (d) {𝐿} to 4 (e) {𝑀, 𝑁} to 5 (f) {𝑅} to 6 4. Remove duplicities 5. Remove zeros 6. Change to length 4 (truncate or add zeros) Algorithm 2 (Querying in Permuterm Index) For query π‘ž, find keys according to the following scheme: βˆ™ for π‘ž = 𝑋, find keys in the form 𝑋$ βˆ™ for π‘ž = 𝑋* , find keys in the form $𝑋* βˆ™ for π‘ž = * 𝑋, find keys in the form 𝑋$* βˆ™ for π‘ž = * 𝑋* , find keys in the form 𝑋* βˆ™ for π‘ž = 𝑋* π‘Œ , find keys in the form π‘Œ $𝑋* Exercise 2/1 Below is a part of index with positions in the form doc1: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, π‘π‘œπ‘ 3, . . .⟩; doc2: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, . . .⟩; . . . βˆ™ angels: 2 : ⟨36, 174, 252, 651⟩; 4 : ⟨12, 22, 102, 432⟩; 7 : ⟨17⟩; βˆ™ fools: 2 : ⟨1, 17, 74, 222⟩; 4 : ⟨8, 78, 108, 458⟩; 7 : ⟨3, 13, 23, 193⟩; βˆ™ fear: 2 : ⟨87, 704, 722, 901⟩; 4 : ⟨13, 43, 113, 433⟩; 7 : ⟨18, 328, 528⟩; βˆ™ in: 2 : ⟨3, 37, 76, 444, 851⟩; 4 : ⟨10, 20, 110, 470, 500⟩; 7 : ⟨5, 15, 25, 195⟩; βˆ™ rush: 2 : ⟨2, 66, 194, 321, 702⟩; 4 : ⟨9, 69, 149, 429, 569⟩; 7 : ⟨4, 14, 404⟩; βˆ™ to: 2 : ⟨47, 86, 234, 999⟩; 4 : ⟨14, 24, 774, 944⟩; 7 : ⟨19, 319, 599, 709⟩; 3 βˆ™ tread: 2 : ⟨57, 94, 333⟩; 4 : ⟨15, 35, 155⟩; 7 : ⟨20, 320⟩; βˆ™ where: 2 : ⟨67, 124, 393, 1001⟩; 4 : ⟨11, 41, 101, 421, 431⟩; 7 : ⟨15, 35, 735⟩; The following terms are phrase queries. Which documents correspond to the following queries and on which positions? a) fools rush in b) fools rush in AND angels fear to tread. The index is incorrect. How? In order to retrieve the query it is necessary that the words are in a sequence. That is, if the word angels is in doc2 on position 36, then the word fear has to be in the same document on the position 37 and so on. For the exercise a) we calculate all possible positions of the phrase. Word fools appears in doc2 on positions ⟨1, 17, 74, 222⟩. That means that the word rush has to appear on positions ⟨2, 18, 75, 223⟩ and the word in on positions ⟨3, 19, 76, 224⟩. Similar process is applied on doc4 and doc7 which retrieves the requested results. doc2 ⟨1, 2, 3⟩, ⟨17, 18, 19⟩, ⟨74, 75, 76⟩, ⟨222, 223, 224⟩ doc4 ⟨8, 9, 10⟩, ⟨78, 79, 80⟩, ⟨108, 109, 110⟩, ⟨458, 459, 460⟩ doc7 ⟨3, 4, 5⟩, ⟨13, 14, 15⟩, ⟨23, 24, 25⟩, ⟨193, 194, 195⟩ Now we look at the original position index and search for whether there is a conjunction between requested and real positions. Take doc2 and check whether the words fools, rush and in are in a sequence on positions ⟨1, 2, 3⟩. Since yes, the system returns doc2 as relevant to our query. Same analogy is used for the remaining documents for which we get the result doc2: {⟨1, 2, 3⟩}; doc4: {⟨8, 9, 10⟩}; and doc7: {⟨3, 4, 5⟩, ⟨13, 14, 15⟩}. For the exercise b) we find the requested positions for also the term angels fear to tread. doc2 ⟨36, 37, 38, 39⟩, ⟨174, 175, 176, 177⟩, ⟨252, 253, 254, 255⟩, ⟨651, 652, 653, 654⟩ doc4 ⟨12, 13, 14, 15⟩, ⟨22, 23, 24, 25⟩, ⟨102, 103, 104, 105⟩, ⟨432, 433, 434, 435⟩ doc7 ⟨17, 18, 19, 20⟩ They appear in the correct order in doc4: {⟨12, 13, 14, 15⟩} and in doc7: {⟨17, 18, 19, 20⟩}. Taking the first part from a), we only check whether the results overlap {doc2(1), doc4(8), doc7(3), doc7(13)} ∩ {doc4(12), doc7(17)} = {doc4(8,12), doc7(3,17), doc7(13,17)}. The index is incorrect. We need to have a look into doc7, where on position 15 are two terms in and where. Exercise 2/2 Below is a part of index with positions in the form doc1: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, π‘π‘œπ‘ 3, . . .⟩; doc2: βŸ¨π‘π‘œπ‘ 1, π‘π‘œπ‘ 2, . . .⟩; . . . βˆ™ ostrich: 1 : <1,7>; 2 : <4,5>; 4 βˆ™ hippo: 1 : <5,8,9>; 3 : <6,9>; βˆ™ lion: 1 : <3,6>; 2 : <3,7>; βˆ™ giraffe: 1 : <2,4>; 2 : <1,2,8>; Which documents correspond to the phrase query lion giraffe hippo and on which positions? Include intermediate results. Candidates: doc1 ⟨3, 4, 5⟩, ⟨6, 7, 8⟩ doc2 ⟨3, 4, 5⟩, ⟨7, 8, 9⟩ Result: doc1 ⟨3, 4, 5⟩ Exercise 2/3 Consider a query composed of two terms. Non-positional postings list of one term is composed of 16 items 𝑃 = [4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120, 215, 300, 500] and the second term has the postings list of only a single element 𝑅 = [47]. Find out how many comparisons (and why) are necessary to find out the intersection of the lists that are organized as follows: a) standard postings lists b) postings lists with skip pointers of skip frequency βˆšοΈ€ |𝑃| a) A naive algorithm compares each element from 𝑃 with each element from 𝑅, which is 11. b) With skip pointers of frequency βˆšοΈ€ |𝑃| = 4 we reduce the number of comparisons. Instead of the next element 𝑖 + 1 only, the pointer goes directly to 𝑖 + βˆšοΈ€ |𝑃| until the referred value is larger than the searched value, after which it jumps back and then step forward by 1. Starting at position 0, the algorithm proceeds as follows: compare 4:47, jump to 14, compare 14:47, jump to 22, compare 22:47, jump to 120, compare 120:47, jump back to 22, step to 32, compare 32:47, step to 47, compare 47:47, done. The number of compare operations is 6. Exercise 2/4 Consider a query composed of two terms. Non-positional postings list with skip pointers of one term is composed of 16 items 𝑃1 = [4, 6, 10, 12, 14, 16, 18, 20, 22, 32, 47, 81, 120, 215, 300, 500] with skip frequency of square root of its length and the second term has the standard postings list 𝑃2 = [18, 32, 60]. How many comparisons are necessary to find out the intersection of the lists? 5 (4, 18), (14, 18), (22, 18), (16, 18), (18, 18), (20, 32), (22, 32), (120, 32), (32, 32), (47, 60), (81, 60). Exercise 2/5 List the comparisons performed to intersect the following sorted non-positional postings lists with skip pointers of frequency 5. 𝑃1 = [2, 10, 12, 16] and 𝑃2 = [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] (2, 1), (2, 7), (2, 3), (10, 3), (10, 4), (10, 5), (10, 6), (10, 7), (10, 12), (10, 8), (10, 9), (10, 10), (12, 11), (12, 12), (16, 13), (16, 14), (16, 15). Exercise 2/6 List the comparisons performed to intersect the following sorted non-positional postings lists with skip pointers of frequency 5. 𝑃1 = [4, 5, 6, 7, 8, 9, 10, 13, 14, 15] and 𝑃2 = [1, 2, 3, 4, 5, 10, 11, 15, 16] (4, 1), (4, 10), (4, 2), (4, 3), (4, 4), (5, 5), (6, 10), (7, 10), (8, 10), (9, 10), (10, 10), (13, 11), (13, 15), (14, 15), (15, 15). Exercise 2/7 a) Find two different words of the same soundex code. b) Find two phonetically similar words of different soundex codes. a) sword and short have codes S630 b) fog and thug have codes F200 and T200 Exercise 2/8 Write elements in a dictionary of the permuterm index generated by the term mama. π‘šπ‘Žπ‘šπ‘Ž$, π‘Žπ‘šπ‘Ž$π‘š, π‘šπ‘Ž$π‘šπ‘Ž, π‘Ž$π‘šπ‘Žπ‘š, $π‘šπ‘Žπ‘šπ‘Ž. 6 Exercise 2/9 Which keys are usable for finding the term s*ng in a permuterm wildcard index? 𝑛𝑔$𝑠* Exercise 2/10 What is the complexity of intersection of two un-ordered posting lists of lengths π‘š and 𝑛? π’ͺ(π‘š log π‘š + 𝑛 log 𝑛) Exercise 2/11 What is the complexity (in π’ͺ-notation) of intersecting of two ordered posting lists of lengths π‘š and 𝑛? π’ͺ(π‘š + 𝑛) Exercise 2/12 What is the worst-case complexity of searching in hash tables? Linear. Seminar 3 Algorithm 3 (Variable byte code) A number 𝑛 is encoded in variable byte code in the following procedure: 1. Take a binary representation of 𝑛 with padding to the length of a multiple of 7. 2. Split into of 7 bit blocks right-to-left. 3. Add 1 to the beginning of the last block and 0 to the beginning of all previous blocks. Example: 𝑉 𝐡(824) = 00000110 10111000 Definition 1 (Unary code) Unary code, also referred to as 𝛼 code, is a coding type where a number 𝑛 is represented by a sequence of 𝑛 1s (or 0s) and terminated with one 0 (or 1). That is, 6 in unary code is 1111110 (or 0000001). The alternative representation in parentheses is equivalent but for this course we use the default representation. 7 Definition 2 (𝛾 code) 𝛾 code is a coding type, that consists of an offset and its length: 𝛾(𝑛) = length of offset(𝑛) in 𝛼,offset(𝑛). Offset is a binary representation of a number 𝑛 without the highest bit (1). Length of this offset encoded in the unary (𝛼) code. Then the number 60 is encoded in 𝛾 as 111110,11100. Definition 3 (𝛿 code) A number 𝑛 is encoded in 𝛿 code in the following way. First calculate the offset of 𝑛 and the length of 𝑛 encode with 𝛾 code. Then add the offset of 𝑛. The final form is 𝛿(𝑛) = length of offset(𝑛) in 𝛾,offset(𝑛). Analogously, 600 is encoded in 𝛿 as 1110,001,001011000. Definition 4 (Zipf’s law) Zipf’s law says that the 𝑖-th most frequent term has the frequency 1 𝑖 . In this exercise we use the dependence of the Zipf’s law 𝑐𝑓𝑖 ∝ 1 𝑖 = 𝑐𝑖 π‘˜ where 𝑐𝑓𝑖 is the number of terms 𝑑𝑖 in a given collection with π‘˜ = βˆ’1. Definition 5 (Heaps’ law) Heaps’ law expresses an empiric dependency of collection size (number of all words) 𝑇 and vocabulary size (number of distinct words) 𝑀 by 𝑀 = π‘˜π‘‡ 𝑏 where 30 ≀ π‘˜ ≀ 100 and 𝑏 β‰ˆ 1 2 . Exercise 3/1 Count variable byte code for the postings list ⟨777, 17 743, 294 068, 31 251 336⟩. Bear in mind that the gaps are encoded. Write in 8-bit blocks. Encode the list of gaps ⟨777, 16 966, 276 325, 30 957 268⟩. Variable byte code of the gaps: βˆ™ 𝑉 𝐡(777) = 00000110 10001001 βˆ™ 𝑉 𝐡(16 966) = 00000001 00000100 11000110 βˆ™ 𝑉 𝐡(276 325) = 00010000 01101110 11100101 βˆ™ 𝑉 𝐡(30 957 268) = 00001110 01100001 00111101 11010100 Result: 𝑉 𝐡(⟨777, 17 743, 294 068, 31 251 336⟩) = 00000110 10001001 00000001 00000100 11000110 00010000 01101110 11100101 00001110 01100001 00111101 11010100 Exercise 3/2 Count 𝛾 and 𝛿 codes for the numbers 63 and 1023. According to the definition 2 it is necessary to count the offsets as binary representations without the highest bit 6310 = 1111112 and offset(63) = 11111. Offset length is encoded in 𝛼 as |11111| = 5 𝛼(5) = 111110. Finally, 𝛾(63) = 111110, 11111. Analogically for 1023. 102310 = 11111111112, offset is 111111111, its length is |111111111| = 9 𝛼(9) = 1111111110. Then 𝛾(1023) = 1111111110, 111111111. 𝛿 is a little more complicated. First we count the offset 63 = 11111 and its length |11111| = 5. The value of 5 we encode in 𝛾 so 𝛾(5) = 110, 01. By definition 3 we have 𝛿(63) = 110, 01, 11111. And finally, 𝛿(1023) = 1110, 010, 111111111. 8 Exercise 3/3 Calculate the variable byte code, 𝛾 code and 𝛿 code of the postings list 𝑃 = [32, 160, 162]. Note that gaps are encoded. Include intermediate results (offsets, lengths). offset 32 = 0000 and 𝛼(|0000|) = 11110 𝛾(32) = 11110, 0000 offset 128 = 0000000 and 𝛼(|0000000|) = 11111110 𝛾(128) = 11111110, 0000000 offset 2 = 0 and 𝛼(|0|) = 10 𝛾(2) = 10, 0 𝛾(𝑃) = 111100000111111100000000100 offset 32 = 0000 and 𝛾(|0000|) = 110, 00 𝛿(32) = 110, 00, 0000 offset 128 = 0000000 and 𝛾(|0000000|) = 110, 11 𝛿(128) = 110, 11, 0000000 offset 2 = 0 and 𝛾(|0|) = 0 𝛿(2) = 0, , 0 𝛿(𝑃) = 11000000011011000000000 Exercise 3/4 Consider a posting list with the following list of gaps 𝐺 = [4, 6, 1, 2048, 64, 248, 2, 130]. Using variable byte encoding, βˆ™ What is the largest gap you can encode in 1 byte? βˆ™ What is the largest gap you can encode in 2 bytes? βˆ™ How many bytes will the above gaps list require under this encoding? βˆ™ 64 βˆ™ 2048 βˆ™ 11 Exercise 3/5 From the following sequence of 𝛾-encoded gaps, reconstruct first the gaps list and then the original postings list. Recall that the 𝛼 code encodes a number 𝑛 with 𝑛 1s followed by one 0. 1110001110101011111101101111011 [1110001, 11010, 101, 11111011011, 11011] [1001, 110, 11, 111011, 111] [9, 6, 3, 59, 7] [9, 15, 18, 77, 84] 9 Exercise 3/6 What does the Zipf’s law say? Answers can vary. For official definition refer to the Manning book. Exercise 3/7 What does the Heaps’ law say? Answers can vary. For official definition refer to the Manning book. Exercise 3/8 A collection of documents contains 4 words: one, two, three, four of decreasing word frequencies 𝑓1, 𝑓2, 𝑓3 and 𝑓4. The total number of tokens in the collection is 5000. Assume that the Zipf’s law holds for this collection perfectly. What are the word frequencies? We use the Zipf’s law in Definition 4. The least frequent term is four, then three, two and the most frequent is one. Applying the Zipf’s law we get 𝑐𝑓1 + 𝑐𝑓2 + 𝑐𝑓3 + 𝑐𝑓4 = 5000 𝑐 Β· 1βˆ’1 + 𝑐 Β· 2βˆ’1 + 𝑐 Β· 3βˆ’1 + 𝑐 Β· 4βˆ’1 = 5000 𝑐 + 1 2 𝑐 + 1 3 𝑐 + 1 4 𝑐 = 5000 12𝑐 + 6𝑐 + 4𝑐 + 3𝑐 = 60000 25𝑐 = 60000 𝑐 = 2400 and, plugging in to the formula 𝑐𝑓𝑖 = π‘π‘–βˆ’1 , we obtain the term frequency values: 𝑐𝑓1 = 24001 1 = 2400 𝑐𝑓2 = 24001 2 = 1200 𝑐𝑓3 = 24001 3 = 800 𝑐𝑓4 = 24001 4 = 600 Exercise 3/9 How many distinct terms are expected in a document of 1,000,000 tokens? Use the Heaps’ law with parameters π‘˜ = 44 and 𝑏 = 0.5 By Definition 5, 44 Γ— 1,000,0000.5 = 44,000. 10 Seminar 4 Definition 6 (Term weight) Weight of a term 𝑑 in a document 𝑑 is counted as 𝑀 𝑑,𝑑 = {οΈ‚ 1 + log (οΈ€ tf 𝑑,𝑑 )οΈ€ if 𝑛 > 0 0 otherwise where tf 𝑑,𝑑 is the number of terms 𝑑 in a document 𝑑. Definition 7 (Inverse document frequency) Inverse document frequency of a term 𝑑 is defined as idf 𝑑 = log (οΈ‚ 𝑁 df 𝑑 )οΈ‚ where 𝑁 is the number of all documents and df 𝑑 (document frequency) is the number of documents that contain 𝑑. Definition 8 (tf-idf weighting scheme) In the tf-idf weighting scheme, a term 𝑑 in a document 𝑑 has weight tf-idf 𝑑,𝑑 = tf 𝑑,𝑑 Β· idf 𝑑 Definition 9 (Cosine (Euclidean) normalization) A vector 𝑣 is cosine-normalized by 𝑣 𝑗 = 𝑣 𝑗 ||𝑣|| = 𝑣 𝑗 √︁ βˆ‘οΈ€|𝑣| π‘˜=1 𝑣 π‘˜ 2 where 𝑣 𝑗 be the number on the 𝑗-th position in 𝑣. Exercise 4/1 Consider the frequency table of the words of three documents π‘‘π‘œπ‘1, π‘‘π‘œπ‘2, π‘‘π‘œπ‘3 below. Calculate the tf-idf weight of the terms car, auto, insurance, best for each document. idf values of terms are in the table. π‘‘π‘œπ‘1 π‘‘π‘œπ‘2 π‘‘π‘œπ‘3 idf car 27 4 24 1.65 auto 3 33 0 2.08 insurance 0 33 29 1.62 best 14 0 17 1.5 Table 3: Exercise. After counting tf-idf weights by Definition 8 individually for each term we get the following table 11 tf-idf π‘‘π‘œπ‘1 π‘‘π‘œπ‘2 π‘‘π‘œπ‘3 car 44.55 6.6 39.6 auto 6.24 68.64 0 insurance 0 53.46 46.98 best 21 0 25.5 Table 4: Solution. Exercise 4/2 Count document representations as normalized Euclidean weight vectors for each document from the previous exercise. Each vector has four components, one for each term. Normalized Euclidean weight vectors are counted by Definition 9. Denominators π‘š π‘‘π‘œπ‘ 𝑛 for the individual documents are π‘š π‘‘π‘œπ‘1 = βˆšοΈ€ 44.552 + 6.242 + 212 = 49.6451 π‘š π‘‘π‘œπ‘2 = βˆšοΈ€ 6.62 + 68.642 + 53.462 = 87.2524 π‘š π‘‘π‘œπ‘3 = βˆšοΈ€ 39.62 + 46.982 + 25.52 = 66.5247 and the document representations are 𝑑1 = (οΈ‚ 44.55 49.6451 ; 6.24 49.6451 ; 0 49.6451 ; 21 49.6451 )οΈ‚ = (0.8974; 0.1257; 0; 0.423) 𝑑2 = (οΈ‚ 6.6 87.2524 ; 68.64 87.2524 ; 53.46 87.2524 ; 0 87.2524 )οΈ‚ = (0.0756; 0.7876; 0.6127; 0) 𝑑3 = (οΈ‚ 39.6 66.5247 ; 0 66.5247 ; 46.98 66.5247 ; 25.5 66.5247 )οΈ‚ = (0.5953; 0; 0.7062; 0.3833) Exercise 4/3 Based on the weights from the last exercise, compute the relevance scores of the three documents for the query car insurance. Use each of the two weighting schemes: a) Term weight is 1 if the query contains the word and 0 otherwise. b) Euclidean normalized tf-idf. Please note that a document and a representation of this document are different things. Document is always fixed but the representations may vary under different settings and conditions. In this exercise we fix document representations from the last exercises and will count relevance scores for query and documents under two different representations of the query. It might be helpful to view on a query as on another document, as it is a sequence of words. 12 We count the relevance scores for a) as the scalar products of the representation of the query π‘ž = (1, 0, 1, 0) with representations of the documents 𝑑 𝑛 from the last exercise: π‘ž Β· 𝑑1 = 1 Β· 0.8974 + 0 Β· 0.1257 + 1 Β· 0 + 0 Β· 0.423 = 0.8974 π‘ž Β· 𝑑2 = 1 Β· 0.0756 + 0 Β· 0.7876 + 1 Β· 0.6127 + 0 Β· 0 = 0.6883 π‘ž Β· 𝑑3 = 1 Β· 0.5953 + 0 Β· 0 + 1 Β· 0.7062 + 0 Β· 0.3833 = 1.3015 For b) we first need the normalized tf-idf vector π‘ž, which is obtained by dividing each component of the query by the length of idf vector √ 1.652 + 02 + 1.622 + 02 = 2.3123 tf idf tf-idf π‘ž car 1 1.65 1.65 0.7136 auto 0 2.08 0 0 insurance 1 1.62 1.62 0.7006 best 0 1.5 0 0 Table 5: Process of finding the Euclidean normalized tf-idf. Now we multiply π‘ž with the document vectors and we obtain the relevance scores: π‘ž Β· 𝑑1 = 0.7136 Β· 0.8974 + 0 Β· 0.1257 + 0.7006 Β· 0 + 0 Β· 0.423 = 0.6404 π‘ž Β· 𝑑2 = 0.7136 Β· 0.0756 + 0 Β· 0.7876 + 0.7006 Β· 0.6127 + 0 Β· 0 = 0.4832 π‘ž Β· 𝑑3 = 0.7136 Β· 0.5953 + 0 Β· 0 + 0.7006 Β· 0.7062 + 0 Β· 0.3833 = 0.9196 Exercise 4/4 Consider a collection of documents and the terms dog, cat and food that occur in 10βˆ’3π‘₯ , 10βˆ’2π‘₯ and 10βˆ’π‘₯ of the documents, respectively. Now document doc1 contains the words 2𝑦, 𝑦 and 3𝑦 times and doc2 2𝑧, 3𝑧 and 𝑧 times. Order these two documents based on vector space similarity with the query dog food. Intuitively, π‘‘π‘œπ‘1 is more relevant than π‘‘π‘œπ‘2 because π‘‘π‘œπ‘2 is relatively too much about cats and too little about food, which is a satisfactory answer. But precisely: π‘‘π‘œπ‘1 π‘‘π‘œπ‘2 π‘ž dog 2𝑦 Β· 3π‘₯ 2𝑧 Β· 3π‘₯ 3π‘₯ cat 𝑦 Β· 2π‘₯ 3𝑧 Β· 2π‘₯ 0 food 3𝑦 Β· π‘₯ 𝑧 Β· π‘₯ π‘₯ Table 6: tf-idf. π‘‘π‘œπ‘1 π‘‘π‘œπ‘2 π‘ž dog 6π‘₯𝑦/7π‘₯𝑦 = 6/7 6π‘₯𝑧/8.5π‘₯𝑧 = 12/17 3π‘₯/3.2π‘₯ = 15/16 cat 2π‘₯𝑦/7π‘₯𝑦 = 2/7 6π‘₯𝑧/8.5π‘₯𝑧 = 12/17 0 food 3π‘₯𝑦/7π‘₯𝑦 = 3/7 π‘₯𝑧/8.5π‘₯𝑧 = 1/17 π‘₯/3.2π‘₯ = 5/16 Table 7: Representations. 13 π‘ž Β· π‘‘π‘œπ‘1 π‘ž Β· π‘‘π‘œπ‘2 dog 6/7 Β· 15/16 ∼ 0.8 12/17 Β· 15/16 ∼ 0.66 cat 0 0 food 3/7 Β· 5/16 ∼ 0.13 1/17 Β· 5/16 ∼ 0.02 Table 8: Relevance. Here 0.8 + 0.13 > 0.66 + 0.02 and therefore π‘‘π‘œπ‘1 is more relevant than π‘‘π‘œπ‘2. Exercise 4/5 Calculate the vector-space similarity between the query digital cameras and a document containing digital cameras and video cameras by filling in the blank columns in the table below. Assume 𝑁 = 10000000, logarithmic term weighting (columns 𝑀) for both query and documents, idf weighting only for the query and cosine normalization only for the document. and is a STOP word. Query Document relevance df tf w idf π‘ž tf w 𝑑 π‘ž Β· 𝑑 digital 10 000 video 100 000 cameras 50 000 Table 9: Exercise. The tf value is filled according to the occurrences of the terms in both query and document. tf π‘ž = π‘‘π‘–π‘”π‘–π‘‘π‘Žπ‘™ π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Žπ‘  = (1, 0, 1) tf 𝑑 = π‘‘π‘–π‘”π‘–π‘‘π‘Žπ‘™ π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Žπ‘  π‘Žπ‘›π‘‘ π‘£π‘–π‘‘π‘’π‘œ π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Žπ‘  = (1, 1, 2) Logarithmic weighting uses the Definition 6. For the query the values are 𝑀 π‘‘π‘–π‘”π‘–π‘‘π‘Žπ‘™ = 1 + log (1) = 1 + 0 = 1 𝑀 π‘£π‘–π‘‘π‘’π‘œ = 0 𝑀 π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Žπ‘  = 1 + log (1) = 1 + 0 = 1 and for the document 𝑀 π‘‘π‘–π‘”π‘–π‘‘π‘Žπ‘™ = 1 + log (1) = 1 + 0 = 1 𝑀 π‘£π‘–π‘‘π‘’π‘œ = 1 + log (1) = 1 + 0 = 1 𝑀 π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Žπ‘  = 1 + log (2) = 1 + 0.301 = 1.301 Now we need to count the idf weights for the query. These are counted by Definition 7. 𝑖𝑑𝑓 π‘‘π‘–π‘”π‘–π‘‘π‘Žπ‘™ = log (︁ 107 104 )︁ = log (οΈ€ 103 )οΈ€ = 3 𝑖𝑑𝑓 π‘£π‘–π‘‘π‘’π‘œ = log (︁ 107 105 )︁ = log (οΈ€ 102 )οΈ€ = 2 𝑖𝑑𝑓 π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Žπ‘  = log (︁ 107 5Γ—104 )︁ = log (200) = 2.301 14 and π‘ž = 𝑀 Β· 𝑖𝑑𝑓. Cosine normalization for the document is counted similarly as in the last exercises by Definition 9 using 𝑀. 𝑑 π‘‘π‘–π‘”π‘–π‘‘π‘Žπ‘™ = 1√ 12+12+1.3012 = 0.5204 𝑑 π‘£π‘–π‘‘π‘’π‘œ = 1√ 12+12+1.3012 = 0.5204 𝑑 π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Žπ‘  = 1.301√ 12+12+1.3012 = 0.677 The score is the scalar multiple of π‘ž and 𝑑. The final table is Query Document relevance df tf w idf q tf w d π‘ž Β· 𝑑 digital 10 000 1 1 3 3 1 1 0.5204 1.5612 video 100 000 0 0 2 0 1 1 0.5204 0 cameras 50 000 1 1 2.301 2.301 2 1.301 0.677 1.5578 Table 10: Solution. and the similarity score is π‘ π‘π‘œπ‘Ÿπ‘’(𝑑, π‘ž) = 3βˆ‘οΈ 𝑖=1 (𝑑𝑖 Β· π‘žπ‘–) = 3.119. Exercise 4/6 Show that for the query π‘ž1 = affection the documents in the table below are sorted by relevance in the opposite order as for the query π‘ž2 = jealous gossip. Query is tf weight normalized. SaS PaP WH affection 0.996 0.993 0.847 jealous 0.087 0.120 0.466 gossip 0.017 0 0.254 Table 11: Exercise. We add queries to the original table: SaS PaP WH π‘ž1 π‘ž2 affection 0.996 0.993 0.847 1 0 jealous 0.087 0.120 0.466 0 1 gossip 0.017 0 0.254 0 1 Table 12: Exercise with queries. Now we normalize the vectors π‘žπ‘– by Definition 9 and get 15 SaS PaP WH π‘ž1 π‘ž2 π‘ž1𝑛 π‘ž2𝑛 affection 0.996 0.993 0.847 1 0 1 0 jealous 0.087 0.120 0.466 0 1 0 0.7071 gossip 0.017 0 0.254 0 1 0 0.7071 Table 13: Exercise with queries after normalization. In the last step we count the similarity score between the queries and documents by π‘ π‘π‘œπ‘Ÿπ‘’(𝑑, π‘ž) = βˆ‘οΈ€|𝑑| 𝑖=1(𝑑𝑖 Β· π‘žπ‘–) π‘ π‘π‘œπ‘Ÿπ‘’(π‘†π‘Žπ‘†, π‘ž1) = 0.9961 Β· 1 + 0.087 Β· 0 + 0.017 Β· 0 = 0.9961 π‘ π‘π‘œπ‘Ÿπ‘’(𝑃 π‘Žπ‘ƒ, π‘ž1) = 0.993 Β· 1 + 0.120 Β· 0 + 0 Β· 0 = 0.993 π‘ π‘π‘œπ‘Ÿπ‘’(π‘Š 𝐻, π‘ž1) = 0.847 Β· 1 + 0.466 Β· 0 + 0.254 Β· 0 = 0.847 π‘ π‘π‘œπ‘Ÿπ‘’(π‘†π‘Žπ‘†, π‘ž2) = 0.9961 Β· 0 + 0.087 Β· 0.7071 + 0.017 Β· 0.7071 = 0.0735 π‘ π‘π‘œπ‘Ÿπ‘’(𝑃 π‘Žπ‘ƒ, π‘ž2) = 0.993 Β· 0 + 0.120 Β· 0.7071 + 0 Β· 0.7071 = 0.0849 π‘ π‘π‘œπ‘Ÿπ‘’(π‘Š 𝐻, π‘ž2) = 0.847 Β· 0 + 0.466 Β· 0.7071 + 0.254 Β· 0.7071 = 0.5091 The ordering for π‘ž1 is SaS > PaP > WH and for π‘ž2 is WH > PaP > SaS, and we see that they are opposite. Seminar 6 Definition 10 (Recall) Recall describes how many of the relevant documents are retrieved. π‘Ÿπ‘’π‘π‘Žπ‘™π‘™ = 𝑅 = #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ Definition 11 (Precision) Precision describes how many of the retrieved documents are relevant. π‘π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› = 𝑃 = #π‘Ÿπ‘’π‘™π‘’π‘£π‘Žπ‘›π‘‘ π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ #π‘Ÿπ‘’π‘‘π‘Ÿπ‘–π‘’π‘£π‘’π‘‘ Definition 12 (𝐹 -measure) A balanced 𝐹-measure (𝐹1-measure) defines a recall-precision relationship represented by their weighted harmonic mean: 𝐹 = 2 Β· 𝑅 Β· 𝑃 𝑅 + 𝑃 Definition 13 (Mean Average Precision) MAP expresses the precision in each point a new relevant document is included in the result. Is counted as 𝑀 𝐴𝑃(𝑄) = 1 |𝑄| Β· βŽ› ⎝ βˆ‘οΈ π‘žβˆˆπ‘„ 1 π‘Ÿπ‘’π‘™ π‘ž Β· βŽ› ⎝ π‘Ÿπ‘’π‘™ π‘ž βˆ‘οΈ 𝑖=1 π‘π‘Ÿπ‘’π‘π‘– ⎞ ⎠ ⎞ ⎠ where π‘Ÿπ‘’π‘™ π‘ž is the number of relevant documents for query π‘ž and π‘π‘Ÿπ‘’π‘π‘– is the precision at the 𝑖-th document. 16 Definition 14 (πœ… statistic) Let 𝑁 be the total number of documents, 𝐽 is a set of judges and 𝑃(𝐴) = #π‘Žπ‘”π‘Ÿπ‘’π‘’ 𝑁 the number of documents on which the judges agree. Let also define 𝑅 𝑗 and 𝑁 𝑅 𝑗 be the number of relevant and non-relevant documents, respectively, according to the judge 𝑗 ∈ 𝐽 and 𝑃(𝑅) = βˆ‘οΈ€ π‘—βˆˆπ½ 𝑅 𝑗 |𝐽| Β· 𝑁 and 𝑃(𝑁 𝑅) = βˆ‘οΈ€ π‘—βˆˆπ½ 𝑁 𝑅 𝑗 |𝐽| Β· 𝑁 as the number of relevant and non-relevant documents, respectively. Let finally define 𝑃(𝐸) = 𝑃(𝑅)2 + 𝑃(𝑁 𝑅)2 as the approximate number of disagreements between the judges. Then the πœ… statistic is defined as the measure of agreement between the judges πœ… = 𝑃(𝐴) βˆ’ 𝑃(𝐸) 1 βˆ’ 𝑃(𝐸) . Definition 15 (Rocchio relevance feedback) Rocchio relevance feedback has the form π‘ž π‘š = π›Όπ‘ž0 + 𝛽 1 |𝐷 π‘Ÿ| βˆ‘οΈ ⃗𝑑 π‘Ÿβˆˆπ· π‘Ÿ ⃗𝑑 π‘Ÿ βˆ’ 𝛾 1 |𝐷 π‘›π‘Ÿ| βˆ‘οΈ ⃗𝑑 π‘›π‘Ÿβˆˆπ· π‘›π‘Ÿ ⃗𝑑 π‘›π‘Ÿ where π‘ž0 is the original query vector, 𝐷 π‘Ÿ is the set of relevant documents, 𝐷 π‘›π‘Ÿ is the set of non-relevant documents and the values 𝛼, 𝛽, 𝛾 depend on the system setting. Exercise 6/1 The following ordered list of 20 letters 𝑅 and 𝑁 represents relevant (𝑅) and non-relevant (𝑁) retrieved documents as an answer for a query on a collection of 10 000 documents. The leftmost document is expected to be the most relevant. The list contains 6 relevant documents. Assume that the collection contains 8 documents relevant to the query. 𝑅𝑅𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑅𝑁 𝑅𝑁 𝑁 𝑁 𝑅𝑁 𝑁 𝑁 𝑁 𝑅 a) What is the precision on the first 20 results? b) What is the 𝐹-measure on the first 20 results? c) What is the non-interpolated precision of the system at 25% recall? (R=25%) d) What is the interpolated precision of the system at 33% recall? (R>33%) e) Assume that these 20 documents is the complete list of retrieved documents. What is the MAP of the system? Now assume that the system returned all 10,000 documents in an ordered list and above is the top 20. f) What is the highest possible MAP the system can achieve? g) What is the lowest possible MAP the system can achieve? 17 Applying the Definition 11 do calculate (a) we get 𝑃 = 6 20 = 3 10 . For (b) it is necessary to count the recall with Definition 10 as 𝑅 = 6 8 = 3 4 . Using Definition 12 with 𝛼 = 0.5 we count (𝛽2 + 1)𝑃 𝑅 𝛽2 𝑃 + 𝑅 = (12 + 1) Β· 3 10 Β· 3 4 3 10 + 3 4 = 9 20 21 20 = 3 7 . To count the non-interpolated precision of the system at 25% recall for (c) we need the precisions for the documents of recall equal to 25%: 1. 𝑃 = 1 1 𝑅 = 1 8 = 12.5% 2. 𝑃 = 2 2 𝑅 = 2 8 = 25% 3. 𝑃 = 2 3 𝑅 = 2 8 = 25% Β· Β· Β· 8. 𝑃 = 2 8 𝑅 = 2 8 = 25% 9. 𝑃 = 3 9 𝑅 = 3 8 = 37.5% We see that the first document has 𝑅 = 12.5% but this value is less than 25%. Documents 2 to 8 have the desired recall of 25%. Document 9 has already a higher value so we do not include it in the result. Non-interpolated precision is then the set of precisions of these 7 documents 𝑃 = {οΈ‚ 2 2 , 2 3 , 2 4 , 2 5 , 2 6 , 2 7 , 2 8 }οΈ‚ . For the interpolated precision (d) we are looking for the highest precision for the relevant documents of recall higher than 33%. Recall changes if and only if the result contains a relevant document. Therefore, the values are calculated for the documents 11, 15 and 20. 11. 𝑃 = 4 11 𝑅 = 4 8 = 50% 15. 𝑃 = 5 15 𝑅 = 5 8 = 62.5% 20. 𝑃 = 6 20 𝑅 = 6 8 = 75% The requested recall value is exceeded by retrieving the document 9 (37.5 %). Now we only have to find π‘šπ‘Žπ‘₯{𝑃9, 𝑃11, 𝑃15, 𝑃20} = π‘šπ‘Žπ‘₯{3 9 , 4 11 , 5 15 , 6 20 } = 4 11 = 0.36. To estimate the MAP of the system (e) we use the Definition 13. Since we only have one query 𝑁 = |𝑄| = 1 and the first 20 documents contain π‘Ÿπ‘’π‘™ π‘ž = 6 relevant documents 𝑀 𝐴𝑃(𝑄) = 1 1 βŽ› ⎝ 1βˆ‘οΈ 𝑗=1 1 8 (οΈƒ 6βˆ‘οΈ 𝑖=1 𝑃(π‘‘π‘œπ‘π‘–) )οΈƒβŽž ⎠ = 1 1 Β· 1 8 βŽ› ⎜ ⎜ ⎝ 1 1⏟ ⏞ 𝑃 (1.) + 2 2⏟ ⏞ 𝑃 (2.) + 3 9⏟ ⏞ 𝑃 (9.) + 4 11⏟ ⏞ 𝑃 (11.) + 5 15⏟ ⏞ 𝑃 (15.) + 6 20⏟ ⏞ 𝑃 (20.) ⎞ ⎟ ⎟ ⎠ = 1 1 Β· 1 8 Β· 1099 330 = 1099 2640 = 0.416287 18 From Definition 11 and Definition13 follows that if the two remaining relevant documents were on positions 21 and 22 then MAP with π‘Ÿπ‘’π‘™ π‘ž = 8 relevant documents is the highest possible (f) 𝑀 𝐴𝑃(𝑄) = 1 8 (οΈ‚ 1 1 + 2 2 + 3 9 + 4 11 + 5 15 + 6 20 + 7 21 + 8 22 )οΈ‚ = 0.503409 and, on the other hand, if they were on the last places the MAP would be minimal (g) 𝑀 𝐴𝑃(𝑄) = 1 8 (οΈ‚ 1 1 + 2 2 + 3 9 + 4 11 + 5 15 + 6 20 + 7 9999 + 8 10000 )οΈ‚ = 0.41647538. Exercise 6/2 The following ordered list of 5 letters 𝑅 and 𝑁 represent relevant (𝑅) and non-relevant (𝑁) retrieved documents as an answer for a query on a collection of 100 documents. The leftmost document is expected to be the most relevant. The list contains 3 relevant documents. Assume that the collection contains 5 documents relevant to the query. 𝑅𝑁 𝑁 𝑅𝑅 βˆ™ What is the F-measure on the first 5 results? Assume that these 5 documents is the complete list of retrieved documents. βˆ™ What is the MAP of the system? Now assume that the system returned all 100 documents in an ordered list and above is the top 5. βˆ™ What are the highest and lowest possible MAPs the system can achieve? βˆ™ 𝑅 = 3 5 , 𝑃 = 3 5 , then 𝐹 = 3 5 . βˆ™ 1 5 (οΈ€1 1 + 2 4 + 3 5 )οΈ€ . βˆ™ highest 1 5 (οΈ€1 1 + 2 4 + 3 5 + 4 6 + 5 7 )οΈ€ and lowest 1 5 (οΈ€1 1 + 2 4 + 3 5 + 4 99 + 5 100 )οΈ€ . Exercise 6/3 The following two sequences of letters 𝑅 and 𝑁 represent the complete lists of relevant (𝑅) and non-relevant (𝑁) retrieved documents as answers for two queries on a collection of 100 documents. The leftmost document is expected to be the most relevant. Assume that the collection contains 10 documents relevant to the first query and 20 documents relevant to the second query. Find the F-measure and the MAP of this system. 𝑁 𝑅𝑁 𝑁 𝑁 𝑅𝑁 and 𝑁 𝑁 𝑅𝑅𝑅 19 βˆ™ 𝑅1 = 2 10 , 𝑃1 = 2 7 , then 𝐹1 = 4 35 17 35 = 4 17 . βˆ™ 𝑅2 = 3 20 , 𝑃2 = 3 5 , then 𝐹2 = 9 50 3 4 = 6 25 . βˆ™ 𝐹 = 𝐹1+𝐹2 2 = 101 425 . βˆ™ MAP1 = 1 10 (οΈ€1 2 + 2 6 )οΈ€ = 5 2*10*6 = 1 12 . βˆ™ MAP2 = 1 20 (οΈ€1 3 + 2 4 + 3 5 )οΈ€ = 20+30+12 2*20*60 = 31 600 . βˆ™ MAP = MAP1+MAP2 2 = 81 1200 . Exercise 6/4 Below is a table showing how two judges judged the relevance (0 = non-relevant, 1 = relevant) of the set of 12 documents with respect to a query. Assume that you developed an IR system, that for this query returns the documents {4, 5, 6, 7, 8}. Doc ID Judge 1 Judge 2 1 0 0 2 0 0 3 1 1 4 1 1 5 1 0 6 1 0 7 1 0 8 1 0 9 0 1 10 0 1 11 0 1 12 0 1 Table 14: Judges judging the relevance of documents. a) Calculate the πœ… statistic. b) Calculate the recall, precision and 𝐹-measure of your system in which a document is considered relevant if the judges agree. c) Calculate the recall, precision and 𝐹-measure of your system in which a document is considered relevant if at least one of the judges thinks so. For (a) it is necessary to use the Definition 14. First we count 𝑃(𝐴) as the number of documents on which the judges agree. Since these are the documents {1, 2, 3, 4} and the total number of them is 𝑁 = 12, then 𝑃(𝐴) = |{1,2,3,4}| 12 = 4 12 = 1 3 . Now we need the counts of disagreements between the judges. Judge 1 considers the documents 𝑁 𝑅1 = 20 {1, 2, 9, 10, 11, 12} and judge 2 𝑁 𝑅2 = {1, 2, 5, 6, 7, 8} as non-relevant. Plugging in to the formula for 𝑃(𝑁 𝑅) we get 𝑃(𝑁 𝑅) = |𝑁 𝑅1|+|𝑁 𝑅2| 2Β·12 = 2Β·6 24 = 1 2 . We repeat this for 𝑃(𝑅). Since the number of relevant is equal to non-relevant, then 𝑃(𝑅) = 𝑃(𝑁 𝑅) = 1 2 . We finally can count 𝑃(𝐸) as 𝑃(𝐸) = 𝑃(𝑁 𝑅)2 + 𝑃(𝑅)2 = (οΈ‚ 1 2 )οΈ‚2 + (οΈ‚ 1 2 )οΈ‚2 = 1 4 + 1 4 = 1 2 and πœ… = 𝑃(𝐴) βˆ’ 𝑃(𝐸) 1 βˆ’ 𝑃(𝐸) = 1 3 βˆ’ 1 2 1 βˆ’ 1 2 = βˆ’ 1 3 . If πœ… < 0 then the agreement between the judges is more than random. For (b) it is necessary to calculate recall and precision. Our system retrieves the documents {4, 5, 6, 7, 8} as relevant whereas the judges only agree on {3, 4}. The intersection is {4}. As to the Definition 11 we obtain 𝑃 = |{4}| |{4, 5, 6, 7, 8}| = 1 5 and, as the number of relevant documents is |{3, 4}| = 2, the recall is 𝑅 = 1 2 by Definition 10. Then, by Definition 12 the 𝐹-measure is equal to 𝐹 = (𝛽2 + 1)𝑃 𝑅 𝛽2 𝑃 + 𝑅 = (1 + 1)1 5 1 2 1 5 + 1 2 = 2 7 . We similarly count these for (c) which says that a document is relevant if at least one judge considers it relevant. This makes the documents {3, 4, 5, 6, 7, 8, 9, 10, 11, 12} relevant. Their intersection with our result {4, 5, 6, 7, 8} is the set {4, 5, 6, 7, 8} of size 5. Recall is 𝑅 = |{4,5,6,7,8}| |{3,4,5,6,7,8,9,10,11,12}| = 5 10 = 1 2 , precision is 𝑃 = 5 5 = 1 and 𝐹-measure is 𝐹 = (1 + 1)1 2 1 + 1 2 = 2 3 . Exercise 6/5 What is the main purpose of Rocchio relevance feedback? Answers can vary. For official definition refer to the Manning book. Exercise 6/6 A user’s primary query is cheap CDs cheap DVDs extremely cheap CDs. The user has a look on two documents: doc1 a doc2, marking doc1 CDs cheap software cheap CDs as relevant and doc2 cheap thrills DVDs as non-relevant. Assume that we use a simple tf scheme without vector length normalization. What would be the restructured query vector after considering the Rocchio relevance feedback with values 𝛼 = 1, 𝛽 = 0.75 and 𝛾 = 0.25? 21 We rewrite the exercise to the table for an easier processing. relevant non-relevant terms doc1 doc2 query Cds 2 0 2 cheap 2 1 3 software 1 0 0 thrills 0 1 0 DVDs 0 1 1 extremely 0 0 1 Table 15: Now we mark the input if the algorithm by Definition 15. 𝑑 π‘Ÿ ∈ 𝐷 π‘Ÿ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 2 1 0 0 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , 𝑑 π‘›π‘Ÿ ∈ 𝐷 π‘›π‘Ÿ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 1 0 1 1 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ , π‘ž = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 3 0 0 1 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ By filling the values to the formula for π‘ž π‘š we get π‘ž π‘š = 1 Β· βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 3 0 0 1 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ + 0.75 Β· 1 1 βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 2 1 0 0 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ βˆ’ 0.25 Β· 1 1 βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 1 0 1 1 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 2 3 0 0 1 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ + βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1.5 1.5 0.75 0 0 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ βˆ’ βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 0 0.25 0 0.25 0.25 0 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = βŽ› ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 3.5 4.25 0.75 βˆ’0.25 0.75 1 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ Seminar 7 Definition 16 (Naive Bayes Classifier) Naive Bayes (NB) Classifier assumes that the effect of the value of a predictor π‘₯ on a given class 𝑐 is class conditional independent. Bayes theorem provides a way of calculating the 22 posterior probability 𝑃(𝑐|π‘₯) from class prior probability 𝑃(𝑐), predictor prior probability 𝑃(π‘₯) and probability of the predictor given the class 𝑃(π‘₯|𝑐) 𝑃(𝑐|π‘₯) = 𝑃(π‘₯|𝑐)𝑃(𝑐) 𝑃(π‘₯) and for a vector of predictors 𝑋 = (π‘₯1, . . . , π‘₯ 𝑛) 𝑃(𝑐|𝑋) = 𝑃(π‘₯1|𝑐) . . . 𝑃(π‘₯ 𝑛|𝑐)𝑃(𝑐). The class with the highest posterior probability is the outcome of prediction. Exercise 7/1 What is naive about Naive Bayes classifier? Briefly outline its major idea. Answers can vary. For official definition refer to the Manning book. Exercise 7/2 Considering the table of observations, use the Naive Bayes classifier to recommend whether to Play Golf given a day with Outlook = Rainy, Temperature = Mild, Humidity = Normal and Windy = True. Do not deal with the zero-frequency problem. Outlook Temperature Humidity Windy Play Golf Rainy Hot High False No Rainy Hot High True No Overcast Hot High False Yes Sunny Mild High False Yes Sunny Cool Normal False Yes Sunny Cool Normal True No Overcast Cool Normal True Yes Rainy Mild High False No Rainy Cool Normal False Yes Sunny Mild Normal False Yes Rainy Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Sunny Mild High True No Table 16: Exercise. First build the likelihood tables for each predictor 23 Play Golf Yes No Outlook Sunny 3/9 2/5 5/14 Overcast 4/9 0/5 4/14 Rainy 2/9 3/5 5/14 9/14 5/14 Play Golf Yes No Temperature Hot 2/9 2/5 4/14 Mild 4/9 2/5 6/14 Cool 3/9 1/5 4/14 9/14 5/14 Play Golf Yes No Humidity High 3/9 4/5 7/14 Normal 6/9 1/5 7/14 9/14 5/14 Play Golf Yes No Windy True 3/9 2/5 5/14 False 6/9 3/5 9/14 9/14 5/14 We see that probability of Sunny given Yes is 3/9 = 0.33, probability of Sunny is 5/14 = 0.36 and probability of Yes is 9/14 = 0.64. Then we count the likelihoods of Yes and No 𝑃(π‘Œ 𝑒𝑠|π‘…π‘Žπ‘–π‘›π‘¦, 𝑀 𝑖𝑙𝑑, 𝑁 π‘œπ‘Ÿπ‘šπ‘Žπ‘™, 𝑇 π‘Ÿπ‘’π‘’) = = 𝑃(π‘…π‘Žπ‘–π‘›π‘¦|π‘Œ 𝑒𝑠) Β· 𝑃(𝑀 𝑖𝑙𝑑|π‘Œ 𝑒𝑠) Β· 𝑃(𝑁 π‘œπ‘Ÿπ‘šπ‘Žπ‘™|π‘Œ 𝑒𝑠) Β· 𝑃(𝑇 π‘Ÿπ‘’π‘’|π‘Œ 𝑒𝑠) Β· 𝑃(π‘Œ 𝑒𝑠) = 2 9 Β· 4 9 Β· 6 9 Β· 3 9 Β· 9 14 = 0.014109347 𝑃(𝑁 π‘œ|π‘…π‘Žπ‘–π‘›π‘¦, 𝑀 𝑖𝑙𝑑, 𝑁 π‘œπ‘Ÿπ‘šπ‘Žπ‘™, 𝑇 π‘Ÿπ‘’π‘’) = = 𝑃(π‘…π‘Žπ‘–π‘›π‘¦|𝑁 π‘œ) Β· 𝑃(𝑀 𝑖𝑙𝑑|𝑁 π‘œ) Β· 𝑃(𝑁 π‘œπ‘Ÿπ‘šπ‘Žπ‘™|𝑁 π‘œ) Β· (𝑇 π‘Ÿπ‘’π‘’|𝑁 π‘œ) Β· 𝑃(𝑁 π‘œ) = 3 5 Β· 2 5 Β· 1 5 Β· 3 5 Β· 5 14 = 0.010285714 (1) and suggest Yes. We can normalize the likelihoods to obtain the % confidence: 𝑃(π‘Œ 𝑒𝑠|π‘…π‘Žπ‘–π‘›π‘¦, 𝑀 𝑖𝑙𝑑, 𝑁 π‘œπ‘Ÿπ‘šπ‘Žπ‘™, 𝑇 π‘Ÿπ‘’π‘’) = 0.014109347 0.014109347 + 0.010285714 = 57.84% 𝑃(𝑁 π‘œ|π‘…π‘Žπ‘–π‘›π‘¦, 𝑀 𝑖𝑙𝑑, 𝑁 π‘œπ‘Ÿπ‘šπ‘Žπ‘™, 𝑇 π‘Ÿπ‘’π‘’) = 0.010285714 0.014109347 + 0.010285714 = 42.16% Definition 17 (Support Vector Machines Classifier (two-class, linearly separable)) Support Vector Machines (SVM) finds the hyperplane that bisects and is perpendicular to the connecting line of the closest points from the two classes. The separating (decision) hyperplane is defined in terms of a normal (weight) vector w and a scalar intercept term 𝑏 as 𝑓(π‘₯) = w Β· x + 𝑏 where Β· is the dot product of vectors. Finally, the SVM classifier becomes π‘π‘™π‘Žπ‘ π‘ (π‘₯) = 𝑠𝑔𝑛(𝑓(π‘₯)). Exercise 7/3 Draw a sketch explaining the concept of SVM classifier. Include the equation of the separation hyperplane. What are limitations of SVM? Answers can vary. For official definition refer to the Manning book. 24 Exercise 7/4 Build the SVM classifier for the training set {([1, 1], βˆ’1), ([2, 0], βˆ’1), ([2, 3], +1)}. We first take the closest two points from the respective classes: [1, 1] and [2, 3]. We have w = π‘Ž Β· ([1, 1] βˆ’ [2, 3]) = [π‘Ž, 2π‘Ž]. Now we calculate π‘Ž and 𝑏 π‘Ž + 2π‘Ž + 𝑏 = βˆ’1 2π‘Ž + 6π‘Ž + 𝑏 = 1 for the points [1, 1] and [2, 3], respectively. The solution is π‘Ž = 2 5 𝑏 = βˆ’11 5 building the weight vector w = [οΈ‚ 2 5 , 4 5 ]οΈ‚ and the final classifier becomes π‘π‘™π‘Žπ‘ π‘ (π‘₯) = 𝑠𝑔𝑛 (οΈ‚ 2 5 π‘₯1 + 4 5 π‘₯2 βˆ’ 11 5 )οΈ‚ . Exercise 7/5 Explain the concept of classification based on neural networks. Draw a sketch and comment on all components. Answers can vary. For official definition refer to the Manning book. Exercise 7/6 What is the difference between supervised and unsupervised learning? Give examples. Answers can vary. For official definition refer to the Manning book. 25 Seminar 9 Algorithm 1 K-means({βƒ—π‘₯1, . . . , βƒ—π‘₯ 𝑁 }, 𝐾, stopping criterion) 1: (⃗𝑠1, . . . ,⃗𝑠 𝐾) ← SelectRandomSeeds({βƒ—π‘₯1, . . . , βƒ—π‘₯ 𝑁 }, 𝐾) 2: for π‘˜ ← 1 to 𝐾 do 3: βƒ—πœ‡ π‘˜ ← ⃗𝑠 π‘˜ 4: end for 5: while stopping criterion has not been met do 6: for π‘˜ ← 1 to 𝐾 do 7: πœ” π‘˜ ← {} 8: end for 9: for 𝑛 ← 1 to 𝑁 do 10: 𝑗 ← argmin 𝑗′ |βƒ—πœ‡ 𝑗′ βˆ’ βƒ—π‘₯ 𝑛| 11: πœ” 𝑗 ← πœ” 𝑗 βˆͺ {βƒ—π‘₯ 𝑛} ◁ reassigning vectors 12: end for 13: for π‘˜ ← 1 to 𝐾 do 14: βƒ—πœ‡ π‘˜ ← 1 |πœ” π‘˜| βˆ‘οΈ€ βƒ—π‘₯βˆˆπœ” π‘˜ βƒ—π‘₯ ◁ recomputing centroids 15: end for 16: end while 17: return {βƒ—πœ‡1, . . . , βƒ—πœ‡ 𝐾} Exercise 9/1 Use the 𝐾-means algorithm with Euclidean distance to cluster the following 𝑁 = 8 examples into 𝐾 = 3 clusters: 𝐴1 = (2, 10), 𝐴2 = (2, 5), 𝐴3 = (8, 4), 𝐴4 = (5, 8), 𝐴5 = (7, 5), 𝐴6 = (6, 4), 𝐴7 = (1, 2), 𝐴8 = (4, 9). Suppose that the initial seeds (centers of each cluster) are 𝐴1, 𝐴4 and 𝐴7. Run the 𝐾-means algorithm for 3 epochs. After each epoch, draw a 10 Γ— 10 space with all the 8 points and show the clusters with the new centroids. 𝑑(𝐴, 𝐡) denotes the Euclidean distance between 𝐴 = (π‘Ž1, π‘Ž2) and 𝐡 = (𝑏1, 𝑏2). It is calculated as 𝑑(𝐴, 𝐡) = βˆšοΈ€ (π‘Ž1 βˆ’ 𝑏1)2 + (π‘Ž2 βˆ’ 𝑏2)2. Take seeds ⃗𝑠1 = 𝐴1 = (2, 10), ⃗𝑠2 = 𝐴4 = (5, 8), ⃗𝑠3 = 𝐴7 = (1, 2). By 1 we count the alignment for epoch 1: 𝐴1 ∈ πœ”1, 𝐴2 ∈ πœ”3, 𝐴3 ∈ πœ”2, 𝐴4 ∈ πœ”2, 𝐴5 ∈ πœ”2, 𝐴6 ∈ πœ”2, 𝐴7 ∈ πœ”3, 𝐴8 ∈ πœ”2; and we get the clusters: πœ”1 = {𝐴1}, πœ”2 = {𝐴3, 𝐴4, 𝐴5, 𝐴6, 𝐴8}, πœ”3 = {𝐴2, 𝐴7}. Centroids of the clusters: βƒ—πœ‡1 = (2, 10), βƒ—πœ‡2 = ((8+5+7+6+4)/5, (4+8+5+4+9)/5) = (6, 6), βƒ—πœ‡3 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5). After epoch 2 the clusters are πœ”1 = {𝐴1, 𝐴8}, πœ”2 = {𝐴3, 𝐴4, 𝐴5, 𝐴6}, πœ”3 = {𝐴2, 𝐴7} with centroids βƒ—πœ‡1 = (3, 9.5), βƒ—πœ‡2 = (6.5, 5.25) and βƒ—πœ‡3 = (1.5, 3.5). And finally after epoch 3, the clusters are πœ”1 = {𝐴1, 𝐴4, 𝐴8}, πœ”2 = {𝐴3, 𝐴5, 𝐴6}, πœ”3 = {𝐴2, 𝐴7} with centroids βƒ—πœ‡1 = (3.66, 9), βƒ—πœ‡2 = (7, 4.33) and βƒ—πœ‡3 = (1.5, 3.5). 26 A7 ∈ cluster3 A8 ∈ cluster2 end of epoch1 new clusters: 1: {A1}, 2: {A3, A4, A5, A6, A8}, 3: {A2, A7} b) centers of the new clusters: C1= (2, 10), C2= ((8+5+7+6+4)/5, (4+8+5+4+9)/5) = (6, 6), C3= ((2+1)/2, (5+2)/2) = (1.5, 3.5) c) 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 A 1 A2 A3 A 4 A5 A6 A7 A8 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 A1 A 2 A3 A4 A5 A6 A7 A8 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 A1 A 2 A 3 A4 A5 A6 A7 A8 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 A1 A2 A3 A4 A5 A6 A7 A8 x x x d) We would need two more epochs. After the 2nd epoch the results would be: 1: {A1, A8}, 2: {A3, A4, A5, A6}, 3: {A2, A7} with centers C1=(3, 9.5), C2=(6.5, 5.25) and C3=(1.5, 3.5). After the 3rd epoch, the results would be: 1: {A1, A4, A8}, 2: {A3, A5, A6}, 3: {A2, A7} with centers C1=(3.66, 9), C2=(7, 4.33) and C3=(1.5, 3.5). Exercise 2. Nearest Neighbor clustering Use the Nearest Neighbor clustering algorithm and Euclidean distance to cluster the examples from the previous exercise: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Suppose that the threshold t is 4. Solution: A1 is placed in a cluster by itself, so we have K1={A1}. We then look at A2 if it should be added to K1 or be placed in a new cluster. 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 A1 A2 A3 A4 A5 A6 A7 A8 x x x 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 A1 A2 A3 A4 A5 A6 A7 A8x x x Figure 1: Visualization of 𝐾-means clustering algorithm. 27 Exercise 9/2 Consider three points: 𝐴1 = [1, 1], 𝐴2 = [3, 1], 𝐴3 = [6, 1]. Give an example of a point 𝐴4 such that the K-means clustering algorithm with seeds {𝐴2, 𝐴4} and the agglomerative hierarchical clustering algorithm result in different clusterings of {𝐴1, 𝐴2, 𝐴3, 𝐴4} into 2 classes. For example, if 𝐴4 = [2, 1], then K-means results in {{𝐴1, 𝐴4}, {𝐴2, 𝐴3}} and agglomerative in {{𝐴1, 𝐴2, 𝐴4}, {𝐴3}}. Exercise 9/3 What makes a good clustering? Give some clustering evaluation metrics. Answers can vary. For official definition refer to the Manning book. Seminar 11 Definition 18 (Markov Transition Matrix) Given the graph 𝐺 = (𝑉, 𝐸) and teleport probability 𝛼, let 𝑁 = |𝑉 | and 𝐴 be the 𝑁 Γ— 𝑁 link matrix with elements βˆ€π‘’, 𝑣 ∈ 𝑉 : 𝐴 𝑒𝑣 = {οΈƒ 1 (𝑒, 𝑣) ∈ 𝐸 0 otherwise The transition probability matrix 𝑃 is then calculated in the following way: 1. If a row of 𝐴 has all 0s, then substitute all of them with 1s. 2. Divide each 1 by the number of 1s in that row. 3. Multiply each entry by 1 βˆ’ 𝛼. 4. Add 𝛼 𝑁 to each entry. Algorithm 4 (PageRank) 1: function PageRank(𝑃) 2: 𝑖 ← 0 3: βˆ’β†’π‘₯ 𝑖 = (1, 0, . . . , 0) 4: βˆ’β†’π‘₯ 𝑖+1 = (0, 0, . . . , 0) 5: repeat 6: βˆ’β†’π‘₯ 𝑖+1 = βˆ’β†’π‘₯ 𝑖 Β· 𝑃 7: 𝑖 = 𝑖 + 1 8: until π‘₯𝑖 = π‘₯π‘–βˆ’1 9: end function Definition 19 (Hubs and authorities) Given the link matrix 𝐴, let β„Ž(𝑣) denote the hub score and π‘Ž(𝑣) the authority score. First, set the β„Ž(𝑣) a π‘Ž(𝑣) vectors to 1 𝑁 for all vertices 𝑣 ∈ 𝑉 . The scores are calculated as β„Ž(𝑣) = 𝐴 Β· π‘Ž(𝑣) π‘Ž(𝑣) = 𝐴 𝑇 Β· β„Ž(𝑣) 28 which is equivalent to β„Ž(𝑣) = 𝐴 Β· 𝐴 𝑇 Β· β„Ž(𝑣) π‘Ž(𝑣) = 𝐴 𝑇 Β· 𝐴 Β· π‘Ž(𝑣) Exercise 11/1 Assume the web graph 𝐺 = (𝑉 = {π‘Ž, 𝑏, 𝑐}, 𝐸 = {(π‘Ž, 𝑏), (π‘Ž, 𝑐), (𝑏, 𝑐), (𝑐, 𝑏)}). Count PageRank, hub and authority scores for each of the web pages. Rank the pages by the individual scores and observe the connections. You can assume that in each step of the random walk we teleport to a random page with probability 0.1 and uniform distribution. Normalize the hub and authority scores so that the maximum is 1. By Definition 18, rewrite the graph as a link matrix βŽ› ⎝ 0 1 1 0 0 1 0 1 0 ⎞ ⎠ . Trying to apply the step 1 the algorithm, does not contain a row with all 0s, so we proceed to step 2. First row contains two 1s so we divide both of them by 2. Second and third lines only contain one 1, so dividing them by 1 makes no change βŽ› ⎝ 0 1 1 0 0 1 0 1 0 ⎞ ⎠ : 2 : 1 : 1 = βŽ› ⎝ 0 1 2 1 2 0 0 1 0 1 0 ⎞ ⎠ . Now apply step 3. Since 𝛼 = 0.1, multiply all entries by 0.9 βŽ› ⎝ 0 1 2 1 2 0 0 1 0 1 0 ⎞ ⎠ Β· 0.9 = βŽ› ⎝ 0 9 20 9 20 0 0 9 10 0 9 10 0 ⎞ ⎠ and by step 4 add 𝛼 𝑁 = 0.1 3 = 1 30 βŽ› ⎝ 0 9 20 9 20 0 0 9 10 0 9 10 0 ⎞ ⎠ + 1 30 = βŽ› ⎝ 1 30 29 60 29 60 1 30 1 30 14 15 1 30 14 15 1 30 ⎞ ⎠ = 𝑃. Using the Algorithm 4 for PageRank, we select βˆ’β†’π‘₯ 0 = (1, 0, 0) and the transition probability matrix 𝑃 from the previous calculation. 29 βˆ’β†’π‘₯ 1 = βˆ’β†’π‘₯ 0 Β· 𝑃 (2) βˆ’β†’π‘₯ 1 = (1, 0, 0) Β· βŽ› ⎝ 1 30 29 60 29 60 1 30 1 30 14 15 1 30 14 15 1 30 ⎞ ⎠ (3) βˆ’β†’π‘₯ 1 = (οΈ‚ 1 30 , 29 60 , 29 60 )οΈ‚ (4) βˆ’β†’π‘₯ 2 = βˆ’β†’π‘₯ 1 Β· 𝑃 (5) βˆ’β†’π‘₯ 2 = (οΈ‚ 1 30 , 29 60 , 29 60 )οΈ‚ Β· βŽ› ⎝ 1 30 29 60 29 60 1 30 1 30 14 15 1 30 14 15 1 30 ⎞ ⎠ (6) βˆ’β†’π‘₯ 2 = (οΈ‚ 1 30 , 29 60 , 29 60 )οΈ‚ (7) Since π‘₯𝑖 = π‘₯π‘–βˆ’1, we claim the entries of π‘₯2 as PageRanks of the individual web pages. By Definition 19, we set the hub score β„Ž(𝑣) and the authority score π‘Ž(𝑣) to 1s π‘Ž(𝑣) = βŽ› ⎝ 1 1 1 ⎞ ⎠ , β„Ž(𝑣) = βŽ› ⎝ 1 1 1 ⎞ ⎠ . Substituting into the second equation in the definition, we get the hub score β„Ž(𝑣) = 𝐴 Β· 𝐴 𝑇 Β· β„Ž(𝑣) β„Ž(𝑣) = βŽ› ⎝ 0 1 1 0 0 1 0 1 0 ⎞ ⎠ Β· βŽ› ⎝ 0 0 0 1 0 1 1 1 0 ⎞ ⎠ Β· βŽ› ⎝ 1 1 1 ⎞ ⎠ β„Ž(𝑣) = βŽ› ⎝ 2 1 1 1 1 0 1 0 1 ⎞ ⎠ Β· βŽ› ⎝ 1 1 1 ⎞ ⎠ β„Ž(𝑣) = βŽ› ⎝ 4 2 2 ⎞ ⎠ and the authority score π‘Ž(𝑣) = 𝐴 𝑇 Β· 𝐴 Β· π‘Ž(𝑣) π‘Ž(𝑣) = βŽ› ⎝ 0 0 0 1 0 1 1 1 0 ⎞ ⎠ Β· βŽ› ⎝ 0 1 1 0 0 1 0 1 0 ⎞ ⎠ Β· βŽ› ⎝ 1 1 1 ⎞ ⎠ π‘Ž(𝑣) = βŽ› ⎝ 0 0 0 0 2 1 0 1 2 ⎞ ⎠ Β· βŽ› ⎝ 1 1 1 ⎞ ⎠ π‘Ž(𝑣) = βŽ› ⎝ 0 3 3 ⎞ ⎠ . 30 Finally, we normalize them to obtain β„Ž(𝑣) = βŽ› ⎝ 4 2 2 ⎞ ⎠ βŽ› ⎝ 1 0.5 0.5 ⎞ ⎠ and π‘Ž(𝑣) = βŽ› ⎝ 0 3 3 ⎞ ⎠ βŽ› ⎝ 0 1 1 ⎞ ⎠ . Seminar 12 Exercise 12/1 Pick a topic of your interest and describe it by 5-10 words. Open Sketch Engine https://ske.fi.muni.cz/. Go to WebBootCaT. Create a corpus using the description words as seed. Wait until data are downloaded. Search the word corpus for collocations. Solutions can vary. Definition 20 (Index Relations) Suppose that we could pick a random page from the index of 𝐸1 and test whether it is in 𝐸2’s index and symmetrically, test whether a random page from 𝐸2 is in 𝐸1. These experiments give us fractions π‘₯ and 𝑦 such that our estimate is that a fraction π‘₯ of the pages in 𝐸1 are in 𝐸2, while a fraction 𝑦 of the pages in 𝐸2 are in 𝐸1. Then, letting |𝐸𝑖| denote the size of the index of search engine 𝐸𝑖, we have π‘₯|𝐸1| β‰ˆ 𝑦|𝐸2|, from which we have the form we will use |𝐸1| |𝐸2| β‰ˆ 𝑦 π‘₯ . Exercise 12/2 Two web search engines 𝐴 and 𝐡 each generate a large number of pages uniformly at random from their indexes. 30% of 𝐴’s pages are present in 𝐡’s index, while 50% of 𝐡’s pages are present in 𝐴’s index. What is the number of pages in 𝐴’s index relative to 𝐡’s? Substituting to the Definition 20 we get the fractions βˆ™ 3 10 of 𝐴 is in 𝐡 βˆ™ 5 10 of 𝐡 is in 𝐴 and we get the equation 0.3|𝐴| β‰ˆ 0.5|𝐡| |𝐴| |𝐡| β‰ˆ 0.5 0.3 |𝐴| |𝐡| β‰ˆ 5 3 31 Definition 21 (Path Similarity) Similarity between a query XPath 𝑐 π‘ž and a document path 𝑐 𝑑 is calculated as 𝐢𝑅(𝑐 π‘ž, 𝑐 𝑑) = {οΈƒ 1+|𝑐 π‘ž| 1+|𝑐 𝑑| if 𝑐 π‘ž can be expanded to 𝑐 𝑑 by adding nodes to the path 0 otherwise Definition 22 (Structural Term) Structural term is defined as an XML-context/term pair denoted by of existing path to a value and the value itself, where the value itself is also a node in the XML document. For example, an XML document containing only a root element with test. test contains two structural terms and . Exercise 12/3 Consider the following the XML document: Computer Science Jennifer Widom Programming Methodology Introduction to the engineering of computer applications emphasizing modern software engineering principles. Jerry R. Cain Eric Roberts Mehran Sahami 32 Programming Abstractions Abstraction and its relation to programming. Eric Roberts Jerry R. Cain CS106A 1. Write the following expressions: a) Return all titles (including both departments and courses). b) Return all course titles that contain the word programming. c) Return the surnames of all instructors teaching at least one course that contains the word software in its description. d) Return the surnames of all professors teaching at least one course that contains the word software in its description. 2. Calculate the similarity between the queries and the corresponding document paths. a) //Instructors//Last_Name#Cain b) //Course/Instructors/Lecturer/Last_Name#Cain 1. a) //Title b) //Course//Title[contains(current(),’programming’)] c) //Course[contains(Description,’software’)]/Instructors//Last_Name d) //Course[contains(Description,’software’)]/Instructors/Professor/Last_Name 33 2. By Definition 21, a query 𝑐 π‘ž corresponds to document 𝑐 𝑑 if and only if it can be expanded. Original query for a) can be expanded to Course_Catalog/Department/Course/Instructors/Lecturer/Last_Name. Since 𝑐 π‘ž is expandable to 𝑐 𝑑, use the equation from the definition. Substituting for the query length 𝑐 π‘ž = 2 and the document length 𝑐 𝑑 = 6 to the formula we get 𝐢𝑅(𝑐 π‘ž, 𝑐 𝑑) = 1 + |𝑐 π‘ž| 1 + |𝑐 𝑑| = 1 + 2 1 + 6 = 3 7 . For b) we only change the query length 𝑐 π‘ž = 4 and obtain 𝐢𝑅(𝑐 π‘ž, 𝑐 𝑑) = 1 + |𝑐 π‘ž| 1 + |𝑐 𝑑| = 1 + 4 1 + 6 = 5 7 . Exercise 12/4 Count how many structural terms are present in the XML tree: Programming Abstractions Abstraction and its relation to programming Eric Roberts To save space, mark each element with its first letter only (D stands for Description, P for Professor, . . . ). By Definition 22, we count all combinations and write them into the table: C T Programming Abstractions C I P F Eric C I P L Roberts T Programming Abstractions I P F Eric I P L Roberts Programming Abstractions P F Eric P L Roberts C D Abstraction . . . F Eric L Roberts D Abstraction . . . Eric Roberts Abstraction . . . There are 16 structural terms in total. 34