Modern Information Retrieval
Chapter 4
Retrieval Evaluation
The Cranﬁeld Paradigm
Retrieval Performance Evaluation
Evaluation Using Reference Collections
Interactive Systems Evaluation
Search Log Analysis using Clickthrough Data
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 1
Introduction
To evaluate an IR system is to measure how well the
system meets the information needs of the users
This is troublesome, given that a same result set might be
interpreted differently by distinct users
To deal with this problem, some metrics have been deﬁned that,
on average, have a correlation with the preferences of a group of
users
Without proper retrieval evaluation, one cannot
determine how well the IR system is performing
compare the performance of the IR system with that of other
systems, objectively
Retrieval evaluation is a critical and integral
component of any modern IR system
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 2
The Cranﬁeld Paradigm
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 5
The Cranﬁeld Paradigm
Cleverdon obtained a grant from the National Science
Foundation to compare distinct indexing systems
These experiments provided interesting insights, that
culminated in the modern metrics of precision and recall
Recall ratio: the fraction of relevant documents retrieved
Precision ration: the fraction of documents retrieved that are
relevant
For instance, it became clear that, in practical situations,
the majority of searches does not require high recall
Instead, the vast majority of the users require just a few
relevant answers
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 7
The Cranﬁeld Paradigm
The next step was to devise a set of experiments that
would allow evaluating each indexing system in
isolation more thoroughly
The result was a test reference collection composed
of documents, queries, and relevance judgements
It became known as the Cranﬁeld-2 collection
The reference collection allows using the same set of
documents and queries to evaluate different ranking
systems
The uniformity of this setup allows quick evaluation of
new ranking functions
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 8
Reference Collections
Reference collections, which are based on the
foundations established by the Cranﬁeld experiments,
constitute the most used evaluation method in IR
A reference collection is composed of:
A set D of pre-selected documents
A set I of information need descriptions used for testing
A set of relevance judgements associated with each pair [im, dj],
im ∈ I and dj ∈ D
The relevance judgement has a value of 0 if document
dj is non-relevant to im, and 1 otherwise
These judgements are produced by human specialists
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 9
Precision and Recall
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 10
Precision and Recall
Consider,
I: an information request
R: the set of relevant documents for I
A: the answer set for I, generated by an IR system
R ∩ A: the intersection of the sets R and A
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 11
Precision and Recall
The recall and precision measures are deﬁned as
follows
Recall is the fraction of the relevant documents (the set R)
which has been retrieved i.e.,
Recall =
|R ∩ A|
|R|
Precision is the fraction of the
retrieved documents (the set
A) which is relevant i.e.,
Precision =
|R ∩ A|
|A|
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 12
Precision and Recall
The deﬁnition of precision and recall assumes that all
docs in the set A have been examined
However, the user is not usually presented with all docs
in the answer set A at once
User sees a ranked set of documents and examines them
starting from the top
Thus, precision and recall vary as the user proceeds
with their examination of the set A
Most appropriate then is to plot a curve of precision
versus recall
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 13
Precision and Recall
Consider a reference collection and a set of test queries
Let Rq1 be the set of relevant docs for a query q1:
Rq1 = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
Consider a new IR algorithm that yields the following
answer to q1 (relevant docs are marked with a bullet):
01. d123 • 06. d9 • 11. d38
02. d84 07. d511 12. d48
03. d56 • 08. d129 13. d250
04. d6 09. d187 14. d113
05. d8 10. d25 • 15. d3 •
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 14
Precision and Recall
If we examine this ranking, we observe that
The document d123, ranked as number 1, is relevant
This document corresponds to 10% of all relevant documents
Thus, we say that we have a precision of 100% at 10% recall
The document d56, ranked as number 3, is the next relevant
At this point, two documents out of three are relevant, and two
of the ten relevant documents have been seen
Thus, we say that we have a precision of 66.6% at 20% recall
01. d123 • 06. d9 • 11. d38
02. d84 07. d511 12. d48
03. d56 • 08. d129 13. d250
04. d6 09. d187 14. d113
05. d8 10. d25 • 15. d3 •
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 15
Precision and Recall
If we proceed with our examination of the ranking
generated, we can plot a curve of precision versus
recall as follows:
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 16
Precision and Recall
Consider now a second query q2 whose set of relevant
answers is given by
Rq2 = {d3, d56, d129}
The previous IR algorithm processes the query q2 and
returns a ranking, as follows
01. d425 06. d615 11. d193
02. d87 07. d512 12. d715
03. d56 • 08. d129 • 13. d810
04. d32 09. d4 14. d5
05. d124 10. d130 15. d3 •
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 17
Precision and Recall
If we examine this ranking, we observe
The ﬁrst relevant document is d56
It provides a recall and precision levels equal to 33.3%
The second relevant document is d129
It provides a recall level of 66.6% (with precision equal to 25%)
The third relevant document is d3
It provides a recall level of 100% (with precision equal to 20%)
01. d425 06. d615 11. d193
02. d87 07. d512 12. d715
03. d56 • 08. d129 • 13. d810
04. d32 09. d4 14. d5
05. d124 10. d130 15. d3 •
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 18
Precision and Recall
The precision ﬁgures at the 11 standard recall levels
are interpolated as follows
Let rj, j ∈ {0, 1, 2, . . . , 10}, be a reference to the j-th
standard recall level
Then, P(rj) = max∀r | rj≤r P(r)
In our last example, this interpolation rule yields the
precision and recall ﬁgures illustrated below
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 19
Precision and Recall
In the examples above, the precision and recall ﬁgures
have been computed for single queries
Usually, however, retrieval algorithms are evaluated by
running them for several distinct test queries
To evaluate the retrieval performance for Nq queries, we
average the precision at each recall level as follows
P(rj) =
Nq
i=1
Pi(rj)
Nq
where
P(rj) is the average precision at the recall level rj
Pi(rj) is the precision at recall level rj for the i-th query
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 20
Precision and Recall
To illustrate, the ﬁgure below illustrates precision-recall
ﬁgures averaged over queries q1 and q2
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 21
Precision and Recall
Average precision-recall curves are normally used to
compare the performance of distinct IR algorithms
The ﬁgure below illustrates average precision-recall
curves for two distinct retrieval algorithms
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 22
P@5 and P@10
In the case of Web search engines, the majority of
searches does not require high recall
Higher the number of relevant documents at the top of
the ranking, more positive is the impression of the users
Precision at 5 (P@5) and at 10 (P@10) measure the
precision when 5 or 10 documents have been seen
These metrics assess whether the users are getting
relevant documents at the top of the ranking or not
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 25
P@5 and P@10
To exemplify, consider again the ranking for the
example query q1 we have been using:
01. d123 • 06. d9 • 11. d38
02. d84 07. d511 12. d48
03. d56 • 08. d129 13. d250
04. d6 09. d187 14. d113
05. d8 10. d25 • 15. d3 •
For this query, we have P@5 = 40% and P@10 = 40%
Further, we can compute P@5 and P@10 averaged
over a sample of 100 queries, for instance
These metrics provide an early assessment of which
algorithm might be preferable in the eyes of the users
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 26
MAP: Mean Average Precision
The idea here is to average the precision ﬁgures
obtained after each new relevant document is observed
For relevant documents not retrieved, the precision is set to 0
To illustrate, consider again the precision-recall curve
for the example query q1
The mean average precision (MAP) for q1 is given by
MAP1 =
1 + 0.66 + 0.5 + 0.4 + 0.33 + 0 + 0 + 0 + 0 + 0
10
= 0.28
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 27
R-Precision
Let R be the total number of relevant docs for a given
query
The idea here is to compute the precision at the R-th
position in the ranking
For the query q1, the R value is 10 and there are 4
relevants among the top 10 documents in the ranking
Thus, the R-Precision value for this query is 0.4
The R-precision measure is a useful for observing the
behavior of an algorithm for individual queries
Additionally, one can also compute an average
R-precision ﬁgure over a set of queries
However, using a single number to evaluate a algorithm over
several queries might be quite imprecise
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 28
Precision Histograms
The R-precision computed for several queries can be
used to compare two algorithms as follows
Let,
RPA(i) : R-precision for algorithm A for the i-th query
RPB(i) : R-precision for algorithm B for the i-th query
Deﬁne, for instance, the difference
RPA/B(i) = RPA(i) − RPB(i)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 29
Precision Histograms
Figure below illustrates the RPA/B(i) values for two
retrieval algorithms over 10 example queries
The algorithm A performs better for 8 of the queries,
while the algorithm B performs better for the other 2
queries
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 30
MRR: Mean Reciprocal Rank
MRR is a good metric for those cases in which we are
interested in the ﬁrst correct answer such as
Question-Answering (QA) systems
Search engine queries that look for speciﬁc sites
URL queries
Homepage queries
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 31
MRR: Mean Reciprocal Rank
Let,
Ri: ranking relative to a query qi
Scorrect(Ri): position of the ﬁrst correct answer in Ri
Sh: threshold for ranking position
Then, the reciprocal rank RR(Ri) for query qi is given by
RR(Ri) =
1
Scorrect(Ri) if Scorrect(Ri) ≤ Sh
0 otherwise
The mean reciprocal rank (MRR) for a set Q of Nq
queries is given by
MRR(Q) =
Nq
i RR(Ri)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 32
The E-Measure
A measure that combines recall and precision
The idea is to allow the user to specify whether he is
more interested in recall or in precision
The E measure is deﬁned as follows
E(j) = 1 −
1 + b2
b2
r(j) + 1
P(j)
where
r(j) is the recall at the j-th position in the ranking
P(j) is the precision at the j-th position in the ranking
b ≥ 0 is a user speciﬁed parameter
E(j) is the E metric at the j-th position in the ranking
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 33
The E-Measure
The parameter b is speciﬁed by the user and reﬂects the
relative importance of recall and precision
If b = 0
E(j) = 1 − P(j)
low values of b make E(j) a function of precision
If b → ∞
limb→∞ E(j) = 1 − r(j)
high values of b make E(j) a function of recal
For b = 1, the E-measure becomes the F-measure
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 34
F-Measure: Harmonic Mean
The F-measure is also a single measure that combines
recall and precision
F(j) =
2
1
r(j) + 1
P(j)
where
r(j) is the recall at the j-th position in the ranking
P(j) is the precision at the j-th position in the ranking
F(j) is the harmonic mean at the j-th position in the ranking
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 35
F-Measure: Harmonic Mean
The function F assumes values in the interval [0, 1]
It is 0 when no relevant documents have been retrieved
and is 1 when all ranked documents are relevant
Further, the harmonic mean F assumes a high value
only when both recall and precision are high
To maximize F requires ﬁnding the best possible
compromise between recall and precision
Notice that setting b = 1 in the formula of the E-measure
yields
F(j) = 1 − E(j)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 36
DCG — Discounted Cumulated Gain
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 42
Discounted Cumulated Gain
Precision and recall allow only binary relevance
assessments
As a result, there is no distinction between highly
relevant docs and mildly relevant docs
These limitations can be overcome by adopting graded
relevance assessments and metrics that combine them
The discounted cumulated gain (DCG) is a metric
that combines graded relevance assessments
effectively
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 43
Discounted Cumulated Gain
When examining the results of a query, two key
observations can be made:
highly relevant documents are preferable at the top of the ranking
than mildly relevant ones
relevant documents that appear at the end of the ranking are less
valuable
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 44
Discounted Cumulated Gain
Consider that the results of the queries are graded on a
scale 0–3 (0 for non-relevant, 3 for strong relevant docs)
For instance, for queries q1 and q2, consider that the
graded relevance scores are as follows:
Rq1 = { [d3, 3], [d5, 3], [d9, 3], [d25, 2], [d39, 2],
[d44, 2], [d56, 1], [d71, 1], [d89, 1], [d123, 1] }
Rq2 = { [d3, 3], [d56, 2], [d129, 1] }
That is, while document d3 is highly relevant to query q1,
document d56 is just mildly relevant
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 45
Discounted Cumulated Gain
Given these assessments, the results of a new ranking
algorithm can be evaluated as follows
Specialists associate a graded relevance score to the
top 10-20 results produced for a given query q
This list of relevance scores is referred to as the gain vector G
Considering the top 15 docs in the ranking produced for
queries q1 and q2, the gain vectors for these queries are:
G1 = (1, 0, 1, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 3)
G2 = (0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 46
Discounted Cumulated Gain
By summing up the graded scores up to any point in the
ranking, we obtain the cumulated gain (CG)
For query q1, for instance, the cumulated gain at the ﬁrst
position is 1, at the second position is 1+0, and so on
Thus, the cumulated gain vectors for queries q1 and q2
are given by
CG1 = (1, 1, 2, 2, 2, 5, 5, 5, 5, 7, 7, 7, 7, 7, 10)
CG2 = (0, 0, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 6)
For instance, the cumulated gain at position 8 of CG1 is
equal to 5
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 47
Discounted Cumulated Gain
In formal terms, we deﬁne
Given the gain vector Gj for a test query qj, the CGj associated
with it is deﬁned as
CGj[i] =



Gj[1] if i = 1;
Gj[i] + CGj[i − 1] otherwise
where CGj[i] refers to the cumulated gain at the ith position of
the ranking for query qj
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 48
Discounted Cumulated Gain
We also introduce a discount factor that reduces the
impact of the gain as we move upper in the ranking
A simple discount factor is the logarithm of the ranking
position
If we consider logs in base 2, this discount factor will be
log2 2 at position 2, log2 3 at position 3, and so on
By dividing a gain by the corresponding discount factor,
we obtain the discounted cumulated gain (DCG)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 49
Discounted Cumulated Gain
More formally,
Given the gain vector Gj for a test query qj, the vector DCGj
associated with it is deﬁned as
DCGj[i] =



Gj[1] if i = 1;
Gj [i]
log2 i + DCGj[i − 1] otherwise
where DCGj[i] refers to the discounted cumulated gain at the ith
position of the ranking for query qj
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 50
Discounted Cumulated Gain
For the example queries q1 and q2, the DCG vectors are
given by
DCG1 = (1.0, 1.0, 1.6, 1.6, 1.6, 2.8, 2.8, 2.8, 2.8, 3.4, 3.4, 3.4, 3.4, 3.4, 4.2)
DCG2 = (0.0, 0.0, 1.3, 1.3, 1.3, 1.3, 1.3, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 2.4)
Discounted cumulated gains are much less affected by
relevant documents at the end of the ranking
By adopting logs in higher bases the discount factor can
be accentuated
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 51
DCG Curves
To produce CG and DCG curves over a set of test
queries, we need to average them over all queries
Given a set of Nq queries, average CG[i] and DCG[i]
over all queries are computed as follows
CG[i] =
Nq
j=1
CGj[i]
Nq
; DCG[i] =
Nq
j=1
DCGj[i]
Nq
For instance, for the example queries q1 and q2, these
averages are given by
CG = (0.5, 0.5, 2.0, 2.0, 2.0, 3.5, 3.5, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0)
DCG = (0.5, 0.5, 1.5, 1.5, 1.5, 2.1, 2.1, 2.2, 2.2, 2.5, 2.5, 2.5, 2.5, 2.5, 3.3)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 52
DCG Curves
Then, average curves can be drawn by varying the rank
positions from 1 to a pre-established threshold
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 53
Ideal CG and DCG Metrics
Recall and precision ﬁgures are computed relatively to
the set of relevant documents
CG and DCG scores, as deﬁned above, are not
computed relatively to any baseline
This implies that it might be confusing to use them
directly to compare two distinct retrieval algorithms
One solution to this problem is to deﬁne a baseline to
be used for normalization
This baseline are the ideal CG and DCG metrics, as we
now discuss
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 54
Ideal CG and DCG Metrics
For a given test query q, assume that the relevance
assessments made by the specialists produced:
n3 documents evaluated with a relevance score of 3
n2 documents evaluated with a relevance score of 2
n1 documents evaluated with a score of 1
n0 documents evaluated with a score of 0
The ideal gain vector IG is created by sorting all
relevance scores in decreasing order, as follows:
IG = (3, . . . , 3, 2, . . . , 2, 1, . . . , 1, 0, . . . , 0)
For instance, for the example queries q1 and q2, we have
IG1 = (3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0)
IG2 = (3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 55
Ideal CG and DCG Metrics
Ideal CG and ideal DCG vectors can be computed
analogously to the computations of CG and DCG
For the example queries q1 and q2, we have
ICG1 = (3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, 19, 19)
ICG2 = (3, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6)
The ideal DCG vectors are given by
IDCG1 = (3.0, 6.0, 7.9, 8.9, 9.8, 10.5, 10.9, 11.2, 11.5, 11.8, 11.8, 11.8, 11.8, 11.8, 11.8)
IDCG2 = (3.0, 5.0, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 56
Ideal CG and DCG Metrics
Further, average ICG and average IDCG scores can
be computed as follows
ICG[i] =
Nq
j=1
ICGj[i]
Nq
; IDCG[i] =
Nq
j=1
IDCGj[i]
Nq
For instance, for the example queries q1 and q2, ICG
and IDCG vectors are given by
ICG = (3.0, 5.5, 7.5, 8.5, 9.5, 10.5, 11.0, 11.5, 12.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5)
IDCG = (3.0, 5.5, 6.8, 7.3, 7.7, 8.1, 8.3, 8.4, 8.6, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7)
By comparing the average CG and DCG curves for an
algorithm with the average ideal curves, we gain insight
on how much room for improvement there is
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 57
Normalized DCG
Precision and recall ﬁgures can be directly compared to
the ideal curve of 100% precision at all recall levels
DCG ﬁgures, however, are not build relative to any ideal
curve, which makes it difﬁcult to compare directly DCG
curves for two distinct ranking algorithms
This can be corrected by normalizing the DCG metric
Given a set of Nq test queries, normalized CG and DCG
metrics are given by
NCG[i] = CG[i]
ICG[i]
; NDCG[i] = DCG[i]
IDCG[i]
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 58
Normalized DCG
For instance, for the example queries q1 and q2, NCG
and NDCG vectors are given by
NCG = (0.17, 0.09, 0.27, 0.24, 0.21, 0.33, 0.32,
0.35, 0.33, 0.40, 0.40, 0.40, 0.40, 0.40, 0.64)
NDCG = (0.17, 0.09, 0.21, 0.20, 0.19, 0.25, 0.25,
0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.38)
The area under the NCG and NDCG curves represent
the quality of the ranking algorithm
Higher the area, better the results are considered to be
Thus, normalized ﬁgures can be used to compare two
distinct ranking algorithms
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 59
Discussion on DCG Metrics
CG and DCG metrics aim at taking into account
multiple level relevance assessments
This has the advantage of distinguishing highly relevant
documents from mildly relevant ones
The inherent disadvantages are that multiple level
relevance assessments are harder and more time
consuming to generate
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 60
Discussion on DCG Metrics
Despite these inherent difﬁculties, the CG and DCG
metrics present beneﬁts:
They allow systematically combining document ranks and
relevance scores
Cumulated gain provides a single metric of retrieval performance
at any position in the ranking
It also stresses the gain produced by relevant docs up to a
position in the ranking, which makes the metrics more imune to
outliers
Further, discounted cumulated gain allows down weighting the
impact of relevant documents found late in the ranking
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 61
Rank Correlation Metrics
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 74
Rank Correlation Metrics
Precision and recall allow comparing the relevance of
the results produced by two ranking functions
However, there are situations in which
we cannot directly measure relevance
we are more interested in determining how differently a ranking
function varies from a second one that we know well
In these cases, we are interested in comparing the
relative ordering produced by the two rankings
This can be accomplished by using statistical functions
called rank correlation metrics
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 75
Rank Correlation Metrics
Let rankings R1 and R2
A rank correlation metric yields a correlation coefﬁcient
C(R1, R2) with the following properties:
−1 ≤ C(R1, R2) ≤ 1
if C(R1, R2) = 1, the agreement between the two rankings is
perfect i.e., they are the same.
if C(R1, R2) = −1, the disagreement between the two rankings is
perfect i.e., they are the reverse of each other.
if C(R1, R2) = 0, the two rankings are completely independent.
increasing values of C(R1, R2) imply increasing agreement
between the two rankings.
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 76
The Spearman Coefﬁcient
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 77
The Spearman Coefﬁcient
The Spearman coefﬁcient is likely the mostly used rank
correlation metric
It is based on the differences between the positions of a
same document in two rankings
Let
s1,j be the position of a document dj in ranking R1 and
s2,j be the position of dj in ranking R2
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 78
The Spearman Coefﬁcient
Consider 10 example documents retrieved by two
distinct rankings R1 and R2. Let s1,j and s2,j be the
document position in these two rankings, as follows:
documents s1,j s2,j si,j − s2,j (s1,j − s2,j)2
d123 1 2 -1 1
d84 2 3 -1 1
d56 3 1 +2 4
d6 4 5 -1 1
d8 5 4 +1 1
d9 6 7 -1 1
d511 7 8 -1 1
d129 8 10 -2 4
d187 9 6 +3 9
d25 10 9 +1 1
Sum of Square Distances 24
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 79
The Spearman Coefﬁcient
By plotting the rank positions for R1 and R2 in a
2-dimensional coordinate system, we observe that
there is a strong correlation between the two rankings
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 80
The Spearman Coefﬁcient
To produce a quantitative assessment of this
correlation, we sum the squares of the differences for
each pair of rankings
If there are K documents ranked, the maximum value
for the sum of squares of ranking differences is given by
K × (K2 − 1)
3
Let K = 10
If the two rankings were in perfect disagreement, then this value
is (10 × (102
− 1))/3, or 330
On the other hand, if we have a complete agreement the sum is 0
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 81
The Spearman Coefﬁcient
Let us consider the fraction
K
j=1(s1,j − s2,j)2
K×(K2−1)
3
Its value is
0 when the two rankings are in perfect agreement
+1 when they are in perfect disagreement
If we multiply the fraction by 2, its value shifts to the
range [0, +2]
If we now subtract the result from 1, the resultant value
shifts to the range [−1, +1]
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 82
The Spearman Coefﬁcient
This reasoning suggests deﬁning the correlation
between the two rankings as follows
Let s1,j and s2,j be the positions of a document dj in two
rankings R1 and R2, respectively
Deﬁne
S(R1, R2) = 1 −
6× K
j=1(s1,j−s2,j)2
K×(K2−1)
where
S(R1, R2) is the Spearman rank correlation coefﬁcient
K indicates the size of the ranked sets
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 83
The Spearman Coefﬁcient
For the rankings in Figure below, we have
S(R1, R2) = 1 −
6 × 24
10 × (102 − 1)
= 1 −
144
990
= 0.854
documents s1,j s2,j si,j − s2,j (s1,j − s2,j)2
d123 1 2 -1 1
d84 2 3 -1 1
d56 3 1 +2 4
d6 4 5 -1 1
d8 5 4 +1 1
d9 6 7 -1 1
d511 7 8 -1 1
d129 8 10 -2 4
d187 9 6 +3 9
d25 10 9 +1 1
Sum of Square Distances 24
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 84
The Kendall Tau Coefﬁcient
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 85
The Kendall Tau Coefﬁcient
It is difﬁcult to assign an operational interpretation to
Spearman coefﬁcient
One alternative is to use a coefﬁcient that has a natural
and intuitive interpretation, as the Kendall Tau
coefﬁcient
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 86
The Kendall Tau Coefﬁcient
When we think of rank correlations, we think of how two
rankings tend to vary in similar ways
To illustrate, consider two documents dj and dk and
their positions in the rankings R1 and R2
Further, consider the differences in rank positions for
these two documents in each ranking, i.e.,
s1,k − s1,j
s2,k − s2,j
If these differences have the same sign, we say that the
document pair [dk, dj] is concordant in both rankings
If they have different signs, we say that the document
pair is discordant in the two rankings
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 87
The Kendall Tau Coefﬁcient
Consider the top 5 documents in rankings R1 and R2
documents s1,j s2,j si,j − s2,j
d123 1 2 -1
d84 2 3 -1
d56 3 1 +2
d6 4 5 -1
d8 5 4 +1
The ordered document pairs in ranking R1 are
[d123, d84], [d123, d56], [d123, d6], [d123, d8],
[d84, d56], [d84, d6], [d84, d8],
[d56, d6], [d56, d8],
[d6, d8]
for a total of 1
2
× 5 × 4, or 10 ordered pairs
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 88
The Kendall Tau Coefﬁcient
Repeating the same exercise for the top 5 documents in
ranking R2, we obtain
[d56, d123], [d56, d84], [d56, d8], [d56, d6],
[d123, d84], [d123, d8], [d123, d6],
[d84, d8], [d84, d6],
[d8, d6]
We compare these two sets of ordered pairs looking for
concordant and discordant pairs
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 89
The Kendall Tau Coefﬁcient
Let us mark with a C the concordant pairs and with a D
the discordant pairs
For ranking R1, we have
C, D, C, C,
D, C, C,
C, C,
D
For ranking R2, we have
D, D, C, C,
C, C, C,
C, C,
D
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 90
The Kendall Tau Coefﬁcient
That is, a total of 20, i.e., K(K − 1), ordered pairs are
produced jointly by the two rankings
Among these, 14 pairs are concordant and 6 pairs are
discordant
The Kendall Tau coefﬁcient is deﬁned as
τ(R1, R2) = P(R1 = R2) − P(R1 = R2)
In our example
τ(R1, R2) =
14
20
−
6
20
= 0.4
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 91
The Kendall Tau Coefﬁcient
Let,
∆(R1, R2): number of discordant document pairs in R1 and R2
K(K − 1) − ∆(R1, R2): number of concordant document pairs in
R1 and R2
Then,
P(R1 = R2) =
K(K − 1) − ∆(R1, R2)
K(K − 1)
P(R1 = R2) =
∆(R1, R2)
K(K − 1)
which yields τ(R1, R2) = 1 − 2×∆(R1,R2)
K(K−1)
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 92
The Kendall Tau Coefﬁcient
For the case of our previous example, we have
∆(R1, R2) = 6
K = 5
Thus,
τ(R1, R2) = 1 −
2 × 6
5(5 − 1)
= 0.4
as before
The Kendall Tau coefﬁcient is deﬁned only for rankings
over a same set of elements
Most important, it has a simpler algebraic structure than
the Spearman coefﬁcient
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 93
Side-by-Side Panels
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 125
Side-by-Side Panels
A form of evaluating two different systems is to evaluate
their results side by side
Typically, the top 10 results produced by the systems for
a given query are displayed in side-by-side panels
Presenting the results side by side allows controlling:
differences of opinion among subjects
inﬂuences on the user opinion produced by the ordering of the
top results
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 126
Side-by-Side Panels
Side by side panels for Yahoo! and Google
Top 5 answers produced by each search engine, with regard to
the query “information retrieval evaluation”
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 127
Side-by-Side Panels
The side-by-side experiment is simply a judgement on
which side provides better results for a given query
By recording the interactions of the users, we can infer which of
the answer sets are preferred to the query
Side by side panels can be used for quick comparison
of distinct search engines
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 128
A/B Testing & Crowdsourcing
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 130
Crowdsourcing
There are a number of limitations with current
approaches for relevance evaluation
For instance, the Cranﬁeld paradigm is expensive and
has obvious scalability issues
Recently, crowdsourcing has emerged as a feasible
alternative for relevance evaluation
Crowdsourcing is a term used to describe tasks that are
outsourced to a large group of people, called “workers”
It is an open call to solve a problem or carry out a task,
one which usually involves a monetary value in
exchange for such service
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 132
Crowdsourcing
To illustrate, crowdsourcing has been used to validate
research on the quality of search snippets
One of the most important aspects of crowdsourcing is
to design the experiment carefully
It is important to ask the right questions and to use
well-known usability techniques
Workers are not information retrieval experts, so the
task designer should provide clear instructions
Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 133