sheet c j c I uco c I J—I L c L J—u l>J> l- j points Describe the SovindEx phonetic retrieval algorithm [2 points]. Give an example of two similarly-sounding words with the same SoundEx code [1 point]. Explain the weak point of SoundEx [1 point], and give an example of two different-sounding words with the same SoundEx codes [1 point]. eloiltwttw i/o&ttj #fii0. 0 ° 0 n ti A i 0 0 i You may also write on the other side of the sheet, but it won't be scanned. sheet e j I— uio l j l L S—K L—I c I J—j l—X points Consider the following XML document: iu> ?Egg) sausage, and spam ^ Ko+e k^Gi Ko+e brfoi That's not got much spam in it | I | | I | 9.9 W-'''tW\...M Lobster Thermidor With a fried egg on top, and spam 49.9 Draw the XML document as a graph [1 point], and count all structural terms in the document [2 points]. Compute the structural similarity between the structural term item/title#"That's not got much spam in it", and the queries //menu//price#30.9, and //item/title#"Lobster Thermidor" [2 points]. I *Um lhou.$ "fWs ..." hehiA/{W/hote#'*lWs ..." You may also write on the other side of the sheet, but it won't be scanned. sheet ■J UCO l j l L lMj l—4, l I J-_j 1_—X ,~\ r, ,n r. points Explain the following aspects of the K-means flat clustering algorithm [2 points]: 1. What do we need to know about our dataset before using the algorithm? 2. What is the input and the output of the algorithm? 3. What are the two steps that take place in every epoch? 4. How do we decide in which epoch to stop the algorithm? /f. mcA to ^K0(*/ w*mLer tUises \( <^«l i"»iV«t **** Given the points O, and the seeds □, run the if-means algorithm for three epochs. Draw the state of the algorithm at the beginning and after every epoch; no computation should be necessary. What is the output of the algorithm? [2 points] s * / * ]\ a J \ \ \ j \ s n \ J - 1) K / -—L A «,—„—, 4 11 --- s. / □ / ( J T 1 t. r ( V L J \ \J \ p v \ "V lit fen S 5- Perform a hierarchical clustering of the above dataset into three classes using the single-link hierarchical agglomerative clustering algorithm, and draw the resulting dendrogram. [1 point] Is the output the same as the output of the X-means flat hierarchical clustering algorithm above? [1 point] o (5. You may also write on the other side of the sheet, but it won't be scanned. :: i_| :: u n R1 ? q sheet l j l I uio L -j L- L W L—i L- L J—»j l—1 points l' 'j l." 'j State the Zip/'s /aw [1 point] and the Heaps' law [1 point] in its complete variant [1 point]. Let C denote the term-document matrix of a document collection that contains M unique terms. According to the Heaps' law, what is the number of documents N in the collection in relation to M? [3 points] tu su* o| C ?j m * n < t, ocA^e n is M x N < M1 => N < M. You may also write on the other side of the sheet, but it won't be scanned. □□□"I sheet 5 „&HnHnHH:I points You are the maintainer of a text retrieval system. Let E\ denote the complete set of documents in the index of your system and let E2 denote the complete set of documents in the index of a competing system. Suppose the indices of both systems are independent uniform random samples without replacement from the World Wide Web N. The size of Ei is \EX | = 130 trillion (130 • 1012) documents. You take a uniform random subsample of documents without replacement from E\ and you submit each document to the competing system. This gives you an unbiased estimate x — 0.2 of the conditional probability P(d € E2 \ d G E\),d G N. You repeat the same procedure with E2, obtaining an unbiased estimate y = 0.4 of the conditional probability P(d e E\ \ d € E2), d e N. Assume the estimates x, y are the true probabilities. What is the size \E% | of the competing system's index? [4 points] The grey parrot, native to equatorial Africa, is categorized as an endangered species by the International Union for Conservation of Nature (IUCN). Suppose you take a uniform random sample M without replacement of size |M| = 8 000 from the grey parrot population N and mark the sampled animals. After returning the marked animals back into the population, you take a second independent uniform random sample T without replacement of the same size |T| = 8 000 from the population. The number of marked animals R = M n T in the second sample is |jR| = 10. What is the size | of the population of the grey parrot? [4 points] N to* 1 HO-AO11- I0,h ini I n/ \ 1, \ imIiti m in 6-> tooo Write your solution only on this side of the sheet!