Advanced SearchTechniques for Large Scale Data Analytics
Pavel Zezula and Jan Sedmidubsky
Masaryk University
http://disa.fi.muni.cz
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 2
 The Cranfield Paradigm
 Retrieval Performance Evaluation
 Evaluation Using Reference Collections
 Interactive Systems Evaluation
 Search Log Analysis using Clickthrough Data
To evaluate an IR system is to measure how well the
system meets the information needs of the users
This is troublesome, given that a same result set might be
interpreted differently by distinct users
To deal with this problem, some metrics have been defined that,
on average, have a correlation with the preferences of a group of
users
Without proper retrieval evaluation, one cannot
determine how well the IR system is performing
compare the performance of the IR system with that of other
systems, objectively
Retrieval evaluation is a critical and integral
component of any modern IR system
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 3
Cleverdon obtained a grant from the National Science
Foundation to compare distinct indexing systems
These experiments provided interesting insights, that
culminated in the modern metrics of precision and recall
Recall ratio: the fraction of relevant documents retrieved
Precision ratio: the fraction of documents retrieved that are
relevant
For instance, it became clear that, in practicalsituations,
the majority of searches does not require high recall
Instead, the vast majority of the users require just a few
relevant answers
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 5
The next step was to devise a set of experiments that
would allow evaluating each indexing system in
isolation more thoroughly
The result was a test reference collection composed
of documents, queries, and relevance judgements
It became known as the Cranfield-2 collection
The reference collection allows using the same set of
documents and queries to evaluate different ranking
systems
The uniformity of this setup allows quick evaluation of
new ranking functions
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212) 6
Reference collections, which are based on the
foundations established by the Cranfield experiments,
constitute the most used evaluation method in IR
A reference collection is composed of:
A set D of pre-selected documents
A set I of information need descriptions used for testing
A set of relevance judgements associated with each pair [im, dj ],
im ∈ I and dj ∈D
The relevance judgement has a value of 0 if document
dj is non-relevant to im, and 1otherwise
These judgements are produced by human specialists
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Consider,
I: an information request
R: the set of relevant documents for I
A: the answer set for I, generated by an IR system
R∩A: the intersection of the sets R and A
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The recall and precision measures are defined as
follows
Recall is the fraction of the relevant documents (the set R)
which has been retrieved i.e.,
Recall =
|R ∩ A|
|R|
Precision is the fraction of the
retrieved documents (the set
A) which is relevant i.e.,
Precision =
|R ∩ A|
|A|
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The definition of precision and recall assumes that all
docs in the set A have been examined
However, the user is not usually presented with all docs
in the answer set A at once
User sees a ranked set of documents and examines them
starting from the top
Thus, precision and recall vary as the user proceeds
with their examination of the set A
Most appropriate then is to plot a curve of precision
versus recall
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Consider a reference collection and a set of test queries
Let Rq1 be the set of relevant docs for a query q1:
Rq1 = { d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
Consider a new IR algorithm that yields the following
answer to q1 (relevant docs are marked with a bullet):
01. d123 •
02. d84
03. d56 •
04. d6
05. d8
06. d9 •
07. d511
08. d129
09. d187
10. d25 •
11. d38
12. d48
13. d250
14. d113
15. d3 •
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
If we examine this ranking, we observe that
The document d123, ranked as number 1, is relevant
This document corresponds to 10% of all relevant documents
Thus, we say that we have a precision of 100% at 10% recall
The document d56, ranked as number 3, is the next relevant
At this point, two documents out of three are relevant, and two
of the ten relevant documents have been seen
Thus, we say that we have a precision of 66.6% at 20% recall
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
01. d123 •
02. d84
03. d56 •
04. d6
05. d8
06. d9 •
07. d511
08. d129
09. d187
10. d25 •
11. d38
12. d48
13. d250
14. d113
15. d3 •
If we proceed with our examination of the ranking
generated, we can plot a curve of precision versus
recall as follows:
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Consider now a second query q2 whose set of relevant
answers is given by
Rq 2 = { d3, d56, d129}
The previous IR algorithm processes the query q2and
returns a ranking, as follows
01. d425 06. d615 11. d193
02. d87 07. d512 12. d715
03. d56 • 08. d129 • 13. d810
04. d32 09. d4 14. d5
05. d124 10. d130 15. d3 •
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
If we examine this ranking, we observe
The first relevant document is d56
It provides a recall and precision levels equal to 33.3%
The second relevant document is d129
It provides a recall level of 66.6% (with precision equal to 25%)
The third relevant document is d3
It provides a recall level of 100% (with precision equal to 20%)
01. d425 06. d615 11. d193
02. d87 07. d512 12. d715
03. d56 • 08. d129 • 13. d810
04. d32 09. d4 14. d5
05. d124 10. d130 15. d3 •
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The precision figures at the 11 standard recall levels
are interpolated as follows
Let rj , j ∈ {0, 1, 2, . . . , 10}, be a reference to the j-th
standard recall level
Then, P (rj ) = max∀r |rj ≤r P(r)
In our last example, this interpolation rule yields the
precision and recall figures illustrated below
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
In the examples above, the precision and recall figures
have been computed for single queries
Usually, however, retrieval algorithms are evaluated by
running them for several distinct test queries
To evaluate the retrieval performance for Nq queries, we
average the precision at each recall level as follows
where
P(rj ) is the average precision at the recall level r j
Pi(rj ) is the precision at recall level r j for the i-th query
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
To illustrate, the figure below illustrates precision-recall
figures averaged over queries q1 and q2
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Average precision-recall curves are normally used to
compare the performance of distinct IR algorithms
The figure below illustrates average precision-recall
curves for two distinct retrieval algorithms
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
In the case of Web search engines, the majority of
searches does not require high recall
Higher the number of relevant documents at the top of
the ranking, more positive is the impression of the users
Precision at 5 (P@5) and at 10 (P@10) measure the
precision when 5 or 10 documents have been seen
These metrics assess whether the users are getting
relevant documents at the top of the ranking or not
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
To exemplify, consider again the ranking for the
example query q1 we have been using:
For this query, we have P@5 = 40% and P@10 = 40%
Further, we can compute P@5 and P@10 averaged
over a sample of 100 queries, for instance
These metrics provide an early assessment of which
algorithm might be preferable in the eyes of the users
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
01. d123 •
02. d84
03. d56 •
04. d6
05. d8
06. d9 •
07. d511
08. d129
09. d187
10. d25 •
11. d38
12. d48
13. d250
14. d113
15. d3 •
The idea here is to average the precision figures
obtained after each new relevant document is observed
For relevant documents not retrieved, the precision is set to 0
To illustrate, consider again the precision-recall curve
for the example query q1
The mean average precision (MAP) for q1 is given by
MAP1 =
1 + 0.66 + 0.5 + 0.4 + 0.33 + 0 + 0 + 0 + 0 + 0
10
= 0.28
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Let R be the total number of relevant docs for a given
query
The idea here is to compute the precision at the R-th
position in the ranking
For the query q1, the R value is 10 and there are 4
relevants among the top 10 documents in the ranking
Thus, the R-Precision value for this query is 0.4
The R-precision measure is a useful for observing the
behavior of an algorithm for individual queries
Additionally, one can also compute an average
R-precision figure over a set of queries
However, using a single number to evaluate a algorithm over
several queries might be quite imprecise
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The R-precision computed for several queries can be
used to compare two algorithms as follows
Let,
RPA(i) : R-precision for algorithm A for the i-th query
RPB (i) : R-precision for algorithm B for the i-th query
Define, for instance, the difference
RPA/B(i) = RPA(i) − RPB (i)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Figure below illustrates the RPA/B (i) valuesfor two
retrieval algorithms over 10 example queries
The algorithm A performs better for 8 of the queries,
while the algorithm B performs better for the other 2
queries
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
MRR is a good metric for those cases in which we are
interested in the first correct answer such as
Question-Answering (QA) systems
Search engine queries that look for specific sites
URL queries
Homepage queries
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Let,
Ri : ranking relative to a query qi
Scorrect(Ri): position of the first correct answer in R i
Sh: threshold for ranking position
Then, the reciprocal rank RR(Ri) for query qi is given by
The mean reciprocal rank (MRR) for a set Q of Nq
queries is given by
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
A measure that combines recall and precision
The idea is to allow the user to specify whether he is
more interested in recall or in precision
The E measure is defined as follows
where
r(j) is the recall at the j-th position in the ranking
P (j) is the precision at the j-th position in the ranking
b ≥ 0 is a user specified parameter
E(j) is the E metric at the j-th position in the ranking
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The parameter bis specified by the user and reflects the
relative importance of recall and precision
If b =0
E(j) = 1 − P(j)
low values of b make E(j) a function ofprecision
If b → ∞
limb→ ∞ E(j) = 1 − r(j)
high values of bmake E(j) a function of recal
For b =1, the E-measure becomes the F-measure
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The F-measure is also a single measure that combines
recall and precision
where
 r(j) is the recall at the j-th position in the ranking
 P (j)is the precision at the j-th position in the ranking
 F (j) is the harmonic mean at the j-th position in the ranking
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The function F assumes values in the interval [0, 1]
It is 0 when no relevant documents have been retrieved
and is 1 when all ranked documents are relevant
Further, the harmonic mean F assumes a high value
only when both recall and precision are high
To maximize F requires finding the best possible
compromise between recall and precision
Notice that setting b=1 in the formula of the E-measure
yields
F (j) = 1 − E(j)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Precision and recall allow only binary relevance
assessments
As a result, there is no distinction between highly
relevant docs and mildly relevant docs
These limitations can be overcome by adopting graded
relevance assessments and metrics that combine them
The discounted cumulated gain (DCG) is a metric
that combines graded relevance assessments
effectively
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
When examining the results of a query, two key
observations can be made:
highly relevant documents are preferable at the top of the ranking
than mildly relevant ones
relevant documents that appear at the end of the ranking are less
valuable
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Consider that the results of the queries are graded on a
scale 0–3 (0 for non-relevant, 3 for strong relevant docs)
For instance, for queries q1and q2, consider that the
graded relevance scores are as follows:
Rq1 = { [d3, 3], [d5, 3], [d9, 3], [d25, 2], [d39, 2],
[d44,2], [d56,1], [d71,1], [d89,1], [d123,1] }
Rq2 = { [d3, 3], [d56, 2], [d129, 1] }
That is, while document d3 is highly relevant to query q1,
document d56 is just mildly relevant
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Given these assessments, the results of a new ranking
algorithm can be evaluated as follows
Specialists associate a graded relevance score to the
top 10–20 results produced for a given query q
This list of relevance scores is referred to as the gain vector G
Considering the top 15 docs in the ranking produced for
queries q1 and q2, the gain vectors for these queries are:
G1 = (1, 0, 1, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 3)
G2 = (0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
By summing up the graded scores up to any point in the
ranking, we obtain the cumulated gain (CG)
For query q1, for instance, the cumulated gain at the first
position is 1, at the second position is 1+0, and so on
Thus, the cumulated gain vectors for queries q1 and q2
are given by
CG1 = (1, 1, 2, 2, 2, 5, 5, 5, 5, 7, 7, 7, 7, 7, 10)
CG2 = (0, 0, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 6)
For instance, the cumulated gain at position 8 of CG1 is
equal to 5
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
In formal terms, we define
Given the gain vector Gj for a test query qj , the CGj associated
with it is defined as
where CGj [i] refers to the cumulated gain at the i-th position of
the ranking for query qj
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
We also introduce a discount factor that reduces the
impact of the gain as we move upper in the ranking
A simple discount factor is the logarithm of the ranking
position
If we consider logs in base 2, this discount factor will be
log2 2 at position 2, log2 3 at position 3, and so on
By dividing a gain by the corresponding discount factor,
we obtain the discounted cumulated gain (DCG)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
More formally,
Given the gain vector Gj for a test query qj , the vector DCGj
associated with it is defined as
where DCGj [i] refers to the discounted cumulated gain at the i-th
position of the ranking for query qj
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
For the example queries q1and q2, the DCG vectors are
given by
DCG1 = (1.0, 1.0, 1.6, 1.6, 1.6, 2.8, 2.8, 2.8, 2.8, 3.4, 3.4, 3.4, 3.4, 3.4, 4.2)
DCG2 = (0.0, 0.0, 1.3, 1.3, 1.3, 1.3, 1.3, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 2.4)
Discounted cumulated gains are much less affected by
relevant documents at the end of the ranking
By adopting logs in higher bases the discount factor can
be accentuated
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
To produce CG and DCG curves over a set of test
queries, we need to average them over all queries
Given a set of Nq queries, average CG[i] and DCG[i]
over all queries are computed as follows
For instance, for the example queries q1and q2, these
averages are given by
CG = (0.5, 0.5, 2.0, 2.0, 2.0, 3.5, 3.5, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0)
DCG = (0.5, 0.5, 1.5, 1.5, 1.5, 2.1, 2.1, 2.2, 2.2, 2.5, 2.5, 2.5, 2.5, 2.5, 3.3)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Then, average curves can be drawn by varying the rank
positions from 1 to a pre-established threshold
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Recall and precision figures are computed relatively to
the set of relevant documents
CG and DCG scores, as defined above, are not
computed relatively to any baseline
This implies that it might be confusing to use them
directly to compare two distinct retrieval algorithms
One solution to this problem is to define a baseline to
be used for normalization
This baseline are the ideal CG and DCG metrics, as we
now discuss
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
For a given test query q, assume that the relevance
assessments made by the specialists produced:
n3 documents evaluated with a relevance score of 3
n2 documents evaluated with a relevance score of 2
n1 documents evaluated with a score of 1
n0 documents evaluated with a score of 0
The ideal gain vector IG is created by sorting all
relevance scores in decreasing order, as follows:
IG = (3, . .. , 3, 2, . .. , 2, 1, . .. , 1, 0, . .. , 0)
For instance, for the example queries q1 and q2, we have
IG1 = (3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0)
IG2 = (3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Ideal CG and ideal DCG vectors can be computed
analogously to the computations of CG and DCG
For the example queries q1 and q2, we have
ICG1 = (3,6,9,11,13,15,16,17,18,19,19,19,19,19,19)
ICG2 = (3,5,6,6,6,6,6,6,6,6,6,6,6,6,6)
The ideal DCG vectors are given by
IDCG1 = (3.0, 6.0, 7.9, 8.9, 9.8, 10.5, 10.9, 11.2, 11.5, 11.8, 11.8, 11.8, 11.8, 11.8, 11.8)
IDCG2 = (3.0, 5.0, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Further, average ICG and average IDCG scores can
be computed as follows
For instance, for the example queries q1 and q2, ICG
and IDCG vectors are given by
ICG = (3.0, 5.5, 7.5, 8.5, 9.5, 10.5, 11.0, 11.5, 12.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5)
IDCG = (3.0, 5.5, 6.8, 7.3, 7.7, 8.1, 8.3, 8.4, 8.6, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7)
By comparing the average CG and DCG curves for an
algorithm with the average ideal curves, we gain insight
on how much room for improvement there is
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Precision and recall figures can be directly compared to
the ideal curve of 100% precision at all recall levels
DCG figures, however, are not build relative to any ideal
curve, which makes it difficult to compare directly DCG
curves for two distinct ranking algorithms
This can be corrected by normalizing the DCG metric
Given a set of Nq test queries, normalized CG and DCG
metrics are given by
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
For instance, for the example queries q1and q2, NCG
and NDCG vectors are given by
NCG =
NDCG
(0.17, 0.09, 0.27, 0.24, 0.21, 0.33, 0.32,
0.35, 0.33, 0.40, 0.40, 0.40, 0.40, 0.40, 0.64)
= (0.17, 0.09, 0.21, 0.20, 0.19, 0.25, 0.25,
0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.38)
The area under the NCG and NDCG curves represent
the quality of the ranking algorithm
Higher the area, better the results are considered to be
Thus, normalized figures can be used to compare two
distinct ranking algorithms
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
CG and DCG metrics aim at taking into account
multiple level relevance assessments
This has the advantage of distinguishing highly relevant
documents from mildly relevant ones
The inherent disadvantages are that multiple level
relevance assessments are harder and more time
consuming to generate
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Despite these inherent difficulties, the CG and DCG
metrics present benefits:
They allow systematically combining document ranks and
relevance scores
Cumulated gain provides a single metric of retrieval performance
at any position in the ranking
It also stresses the gain produced by relevant docs up to a
position in the ranking, which makes the metrics more imune to
outliers
Further, discounted cumulated gain allows down weighting the
impact of relevant documents found late in the ranking
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Precision and recall allow comparing the relevance of
the results produced by two ranking functions
However, there are situations in which
we cannot directly measure relevance
we are more interested in determining how differently a ranking
function varies from a second one that we know well
In these cases, we are interested in comparing the
relative ordering produced by the two rankings
This can be accomplished by using statistical functions
called rank correlation metrics
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Let rankings R1 and R2
A rank correlation metric yields a correlation coefficient
C(R1, R2) with the following properties:
−1 ≤ C(R1, R2 ) ≤ 1
if C(R1, R2 ) = 1, the agreement between the two rankings is
perfect i.e., they are the same.
if C(R1, R2 ) = −1, the disagreement between the two rankings is
perfect i.e., they are the reverse of each other.
if C(R1, R2 ) = 0, the two rankings are completely independent.
increasing values of C(R1, R2 ) imply increasing agreement
between the two rankings.
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The Spearman coefficient is likely the mostly used rank
correlation metric
It is based on the differences between the positions of a
same document in two rankings
Let
s1,j be the position of a document dj in ranking R1 and
s2,j be the position of dj in ranking R2
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Consider 10 example documents retrieved by two
distinct rankings R1 and R2. Let s1,j and s2,j be the
document position in these two rankings, as follows:
documents s1 ,j s2 ,j si , j − s2 , j (s1,j − s2 , j )2
d123 1 2 -1 1
d84 2 3 -1 1
d56 3 1 +2 4
d6 4 5 -1 1
d8 5 4 +1 1
d9 6 7 -1 1
d511 7 8 -1 1
d129 8 10 -2 4
d187 9 6 +3 9
d25 10 9 +1 1
Sum of Square Distances 24
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
By plotting the rank positions for R1 and R2 in a
2-dimensional coordinate system, we observe that
there is a strong correlation between the two rankings
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
To produce a quantitative assessment of this
correlation, we sum the squares of the differences for
each pair of rankings
If there are K documents ranked, the maximum value
for the sum of squares of ranking differences is given by
K × (K2 − 1)
3
Let K = 10
If the two rankings were in perfect disagreement, then this value
is (10 × (102 − 1))/ 3, or 330
On the other hand, if we have a complete agreement the sum is 0
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Let us consider the fraction
Its value is
0 when the two rankings are in perfect agreement
+1 when they are in perfect disagreement
If we multiply the fraction by 2, its value shifts to the
range [0,+2]
If we now subtract the result from 1, the resultant value
shifts to the range [−1,+1]
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
This reasoning suggests defining the correlation
between the two rankings as follows
Let s1,j and s2,j be the positions of a document dj in two
rankings R1 and R2, respectively
Define
where
S(R1, R2 ) is the Spearman rank correlation coefficient
K indicates the size of the rankedsets
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
For the rankings in Figure below, we have
documents s1 , j s2 , j si , j − s2 , j (s1,j − s2 , j )2
d123 1 2 -1 1
d84 2 3 -1 1
d56 3 1 +2 4
d6 4 5 -1 1
d8 5 4 +1 1
d9 6 7 -1 1
d511 7 8 -1 1
d129 8 10 -2 4
d187 9 6 +3 9
d25 10 9 +1 1
Sum of Square Distances 24
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
It is difficult to assign an operational interpretation to
Spearman coefficient
One alternative is to use a coefficient that has a natural
and intuitive interpretation, as the Kendall Tau
coefficient
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
When we think of rank correlations, we think of how two
rankings tend to vary in similar ways
To illustrate, consider two documents dj and dk and
their positions in the rankings R1 and R2
Further, consider the differences in rank positions for
these two documents in each ranking, i.e.,
s1,k
s2,k
−
−
s1,j
s2,j
If these differences have the same sign, we say that the
document pair [dk, dj] is concordant in both rankings
If they have different signs, we say that the document
pair is discordant in the two rankings
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
for a total of 1 × 5 × 4, or 10 ordered pairs
Consider the top 5 documents in rankings R1 and R2
documents s1 , j s2 , j si , j − s2 , j
d123 1 2 -1
d84 2 3 -1
d56 3 1 +2
d6 4 5 -1
d8 5 4 +1
The ordered document pairs in ranking R1 are
[d123, d84], [d123, d56], [d123, d6], [d123, d8],
[d84, d56], [d84, d6], [d84,d8],
[d56, d6], [d56,d8],
[d6,d8]
2
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Repeating the same exercise for the top 5 documents in
ranking R2, we obtain
[d56,d123],
[d123, d84],
[d56,d84],
[d123, d8],
[d56, d8], [d56,d6],
[d123, d6],
[d84,d6],[d84,d8],
[d8,d6]
We compare these two sets of ordered pairs looking for
concordant and discordant pairs
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Let us mark with a C the concordant pairs and with a D
the discordant pairs
For ranking R1, we have
C, D, C, C,
D, C, C,
C,C,
D
For ranking R2, we have
D, D, C, C,
C, C, C,
C,C,
D
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
That is, a total of 20, i.e., K (K − 1), ordered pairs are
produced jointly by the two rankings
Among these, 14 pairs are concordant and 6 pairs are
discordant
The Kendall Tau coefficient is defined as
τ(R1, R2) = P (R1 = R2) − P (R1 ≠ R2)
In our example
τ(R 1, R 2) =
14 6−
20 20
= 0.4
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Let,
∆(R1 , R2): number of discordant document pairs in R1 and R2
K(K − 1) − ∆(R1 , R2): number of concordant document pairs in
R1 and R2
Then,
P (R 1 = R 2) =
K (K − 1) − ∆(R1, R2)
P (R 1 ≠ R 2) =
K (K −1)
∆(R1, R2)
K (K −1)
which yields τ(R1, R2) = 1 − 2×∆ (R 1 ,R 2 )
K (K −1)
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
τ(R1, R2 ) = 1 −
5(5 − 1)
= 0.4
For the case of our previous example, we have
∆(R1 , R2 ) = 6
K = 5
Thus,
2 × 6
as before
The Kendall Tau coefficient is defined only for rankings
over a same set of elements
Most important, it has a simpler algebraic structure than
the Spearman coefficient
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
A form of evaluating two different systems is to evaluate
their results side by side
Typically, the top 10 results produced by the systems for
a given query are displayed in side-by-side panels
Presenting the results side by side allows controlling:
differences of opinion among subjects
influences on the user opinion produced by the ordering of the
top results
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
Side by side panels for Yahoo! and Google
Top 5 answers produced by each search engine, with regard to
the query “information retrieval evaluation”
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
The side-by-side experiment is simply a judgement on
which side provides better results for a given query
By recording the interactions of the users, we can infer which of
the answer sets are preferred to the query
Side by side panels can be used for quick comparison
of distinct search engines
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
There are a number of limitations with current
approaches for relevance evaluation
For instance, the Cranfield paradigm is expensive and
has obvious scalability issues
Recently, crowdsourcing has emerged as a feasible
alternative for relevance evaluation
Crowdsourcing is a term used to describe tasks that are
outsourced to a large group of people, called “workers”
It is an open call to solve a problem or carry out a task,
one which usually involves a monetary value in
exchange for such service
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)
To illustrate, crowdsourcing has been used to validate
research on the quality of search snippets
One of the most important aspects of crowdsourcing is
to design the experiment carefully
It is important to ask the right questions and to use
well-known usability techniques
Workers are not information retrieval experts, so the
task designer should provide clear instructions
Pavel Zezula, Jan Sedmidubsky. Advanced Search Techniques for Large Scale Data Analytics (PA212)