Advanced SearchTechniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz  Suppose our input data to a map-reduce operation consists of integer values (the keys are not important) ▪ The map function takes an integer i and produces pairs (p, i) such that p is a prime divisor of i ▪ Example: map(12) = [(2,12), (3,12)] ▪ The reduce function is addition ▪ Example: reduce(p, [i , i , ..., i]) is (p, i + i + ... + i)  Compute the output, if the input is the set of integers 15, 21, 24, 30, 49 Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2  Suppose we have the following relations R, S: R S  Apply the natural join algorithm ▪ Apply the Map function to the tuples of relations ▪ Construct the elements that are input to the Reduce function Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3 A B 0 1 1 2 2 3 B C 0 1 1 2 2 3  Design MapReduce algorithms that take a very large file of integers and produce as output: 1) The largest integer; 2) The average of all the integers; 3) The same set of integers, but with each integer appearing only once; 4) The count of the number of distinct integers in the input. Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 4  The algorithm retrieves the six most relevant documents for each query. We focus on the first relevant document retrieved 1) Determine a convenient measure for this task 2) Compute the measure on the following four query rankings with relevant/irrelevant objects: ▪ R1 = {d7, d5, d3, d8, d1} ▪ R2 = {d5, d6, d3, d2, d4} ▪ R3 = {d9, d3, d4, d8, d5} ▪ R4 = {d9, d3, d1, d7, d5} 3) How can be the result value interpreted? Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5  Assume the following two rankings of documents (for some query): ▪ R1 = {d7, d5, d3, d8, d1} ▪ R2 = {d5, d8, d3, d1, d7}  Based on these rankings compute: ▪ Spearman rank correlation coefficient ▪ Kendall Tau coefficient Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6  The SSE (sum squared error) is a common measure of the quality of a cluster ▪ sum of the squares of the distances between each of the points of the cluster and the centroid  Sometimes, we decide to split a cluster in order to reduce the SSE ▪ Suppose a cluster consists of the following three points: (9,5), (2,2) and (4,8) ▪ Calculate the reduction in the SSE if we partition the cluster optimally into two clusters Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7  Perform a hierarchical clustering on points A–F ▪ Using the centroid proximity measure ▪ There is a tie for which pair of clusters is closest. Follow both choices and identify the clusters Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8 E [5, 27] [33, 33] [21, 21] [28, 6] [10, 10] [0, 0] C D B A F