Advanced SearchTechniques for Large Scale Data Analytics
Pavel Zezula and Jan Sedmidubsky
Masaryk University
http://disa.fi.muni.cz
 Suppose our input data to a map-reduce
operation consists of integer values (the keys
are not important)
▪ The map function takes an integer i and produces
pairs (p, i) such that p is a prime divisor of i
▪ Example: map(12) = [(2,12), (3,12)]
▪ The reduce function is addition
▪ Example: reduce(p, [i , i , ..., i]) is (p, i + i + ... + i)
 Compute the output, if the input is the set of
integers 15, 21, 24, 30, 49
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2
 Suppose we have the following relations R, S:
R S
 Apply the natural join algorithm
▪ Apply the Map function to the tuples of relations
▪ Construct the elements that are input to the Reduce
function
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3
A B
0 1
1 2
2 3
B C
0 1
1 2
2 3
 Design MapReduce algorithms that take a
very large file of integers and produce as
output:
1) The largest integer;
2) The average of all the integers;
3) The same set of integers, but with each integer
appearing only once;
4) The count of the number of distinct integers in
the input.
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 4
 The algorithm retrieves the six most relevant
documents for each query. We focus on the
first relevant document retrieved
1) Determine a convenient measure for this task
2) Compute the measure on the following four
query rankings with relevant/irrelevant objects:
▪ R1 = {d7, d5, d3, d8, d1}
▪ R2 = {d5, d6, d3, d2, d4}
▪ R3 = {d9, d3, d4, d8, d5}
▪ R4 = {d9, d3, d1, d7, d5}
3) How can be the result value interpreted?
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5
 Assume the following two rankings of
documents (for some query):
▪ R1 = {d7, d5, d3, d8, d1}
▪ R2 = {d5, d8, d3, d1, d7}
 Based on these rankings compute:
▪ Spearman rank correlation coefficient
▪ Kendall Tau coefficient
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6
 The SSE (sum squared error) is a common
measure of the quality of a cluster
▪ sum of the squares of the distances between each
of the points of the cluster and the centroid
 Sometimes, we decide to split a cluster in
order to reduce the SSE
▪ Suppose a cluster consists of the following three
points: (9,5), (2,2) and (4,8)
▪ Calculate the reduction in the SSE if we partition
the cluster optimally into two clusters
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 7
 Perform a hierarchical clustering on points A–F
▪ Using the centroid proximity measure
▪ There is a tie for which pair of clusters is closest.
Follow both choices and identify the clusters
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 8
E [5, 27]
[33, 33]
[21, 21]
[28, 6]
[10, 10]
[0, 0]
C
D
B
A
F