Modern Information Retrieval Chapter 4 Retrieval Evaluation The Cranfield Paradigm Retrieval Performance Evaluation Evaluation Using Reference Collections Interactive Systems Evaluation Search Log Analysis using Clickthrough Data Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 1 Introduction To evaluate an IR system is to measure how well the system meets the information needs of the users This is troublesome, given that a same result set might be interpreted differently by distinct users To deal with this problem, some metrics have been defined that, on average, have a correlation with the preferences of a group of users Without proper retrieval evaluation, one cannot determine how well the IR system is performing compare the performance of the IR system with that of other systems, objectively Retrieval evaluation is a critical and integral component of any modern IR system Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 2 The Cranfield Paradigm Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 5 The Cranfield Paradigm Cleverdon obtained a grant from the National Science Foundation to compare distinct indexing systems These experiments provided interesting insights, that culminated in the modern metrics of precision and recall Recall ratio: the fraction of relevant documents retrieved Precision ration: the fraction of documents retrieved that are relevant For instance, it became clear that, in practical situations, the majority of searches does not require high recall Instead, the vast majority of the users require just a few relevant answers Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 7 The Cranfield Paradigm The next step was to devise a set of experiments that would allow evaluating each indexing system in isolation more thoroughly The result was a test reference collection composed of documents, queries, and relevance judgements It became known as the Cranfield-2 collection The reference collection allows using the same set of documents and queries to evaluate different ranking systems The uniformity of this setup allows quick evaluation of new ranking functions Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 8 Reference Collections Reference collections, which are based on the foundations established by the Cranfield experiments, constitute the most used evaluation method in IR A reference collection is composed of: A set D of pre-selected documents A set I of information need descriptions used for testing A set of relevance judgements associated with each pair [im, dj], im ∈ I and dj ∈ D The relevance judgement has a value of 0 if document dj is non-relevant to im, and 1 otherwise These judgements are produced by human specialists Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 9 Precision and Recall Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 10 Precision and Recall Consider, I: an information request R: the set of relevant documents for I A: the answer set for I, generated by an IR system R ∩ A: the intersection of the sets R and A Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 11 Precision and Recall The recall and precision measures are defined as follows Recall is the fraction of the relevant documents (the set R) which has been retrieved i.e., Recall = |R ∩ A| |R| Precision is the fraction of the retrieved documents (the set A) which is relevant i.e., Precision = |R ∩ A| |A| Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 12 Precision and Recall The definition of precision and recall assumes that all docs in the set A have been examined However, the user is not usually presented with all docs in the answer set A at once User sees a ranked set of documents and examines them starting from the top Thus, precision and recall vary as the user proceeds with their examination of the set A Most appropriate then is to plot a curve of precision versus recall Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 13 Precision and Recall Consider a reference collection and a set of test queries Let Rq1 be the set of relevant docs for a query q1: Rq1 = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} Consider a new IR algorithm that yields the following answer to q1 (relevant docs are marked with a bullet): 01. d123 • 06. d9 • 11. d38 02. d84 07. d511 12. d48 03. d56 • 08. d129 13. d250 04. d6 09. d187 14. d113 05. d8 10. d25 • 15. d3 • Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 14 Precision and Recall If we examine this ranking, we observe that The document d123, ranked as number 1, is relevant This document corresponds to 10% of all relevant documents Thus, we say that we have a precision of 100% at 10% recall The document d56, ranked as number 3, is the next relevant At this point, two documents out of three are relevant, and two of the ten relevant documents have been seen Thus, we say that we have a precision of 66.6% at 20% recall 01. d123 • 06. d9 • 11. d38 02. d84 07. d511 12. d48 03. d56 • 08. d129 13. d250 04. d6 09. d187 14. d113 05. d8 10. d25 • 15. d3 • Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 15 Precision and Recall If we proceed with our examination of the ranking generated, we can plot a curve of precision versus recall as follows: Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 16 Precision and Recall Consider now a second query q2 whose set of relevant answers is given by Rq2 = {d3, d56, d129} The previous IR algorithm processes the query q2 and returns a ranking, as follows 01. d425 06. d615 11. d193 02. d87 07. d512 12. d715 03. d56 • 08. d129 • 13. d810 04. d32 09. d4 14. d5 05. d124 10. d130 15. d3 • Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 17 Precision and Recall If we examine this ranking, we observe The first relevant document is d56 It provides a recall and precision levels equal to 33.3% The second relevant document is d129 It provides a recall level of 66.6% (with precision equal to 25%) The third relevant document is d3 It provides a recall level of 100% (with precision equal to 20%) 01. d425 06. d615 11. d193 02. d87 07. d512 12. d715 03. d56 • 08. d129 • 13. d810 04. d32 09. d4 14. d5 05. d124 10. d130 15. d3 • Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 18 Precision and Recall The precision figures at the 11 standard recall levels are interpolated as follows Let rj, j ∈ {0, 1, 2, . . . , 10}, be a reference to the j-th standard recall level Then, P(rj) = max∀r | rj≤r P(r) In our last example, this interpolation rule yields the precision and recall figures illustrated below Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 19 Precision and Recall In the examples above, the precision and recall figures have been computed for single queries Usually, however, retrieval algorithms are evaluated by running them for several distinct test queries To evaluate the retrieval performance for Nq queries, we average the precision at each recall level as follows P(rj) = Nq i=1 Pi(rj) Nq where P(rj) is the average precision at the recall level rj Pi(rj) is the precision at recall level rj for the i-th query Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 20 Precision and Recall To illustrate, the figure below illustrates precision-recall figures averaged over queries q1 and q2 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 21 Precision and Recall Average precision-recall curves are normally used to compare the performance of distinct IR algorithms The figure below illustrates average precision-recall curves for two distinct retrieval algorithms Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 22 P@5 and P@10 In the case of Web search engines, the majority of searches does not require high recall Higher the number of relevant documents at the top of the ranking, more positive is the impression of the users Precision at 5 (P@5) and at 10 (P@10) measure the precision when 5 or 10 documents have been seen These metrics assess whether the users are getting relevant documents at the top of the ranking or not Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 25 P@5 and P@10 To exemplify, consider again the ranking for the example query q1 we have been using: 01. d123 • 06. d9 • 11. d38 02. d84 07. d511 12. d48 03. d56 • 08. d129 13. d250 04. d6 09. d187 14. d113 05. d8 10. d25 • 15. d3 • For this query, we have P@5 = 40% and P@10 = 40% Further, we can compute P@5 and P@10 averaged over a sample of 100 queries, for instance These metrics provide an early assessment of which algorithm might be preferable in the eyes of the users Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 26 MAP: Mean Average Precision The idea here is to average the precision figures obtained after each new relevant document is observed For relevant documents not retrieved, the precision is set to 0 To illustrate, consider again the precision-recall curve for the example query q1 The mean average precision (MAP) for q1 is given by MAP1 = 1 + 0.66 + 0.5 + 0.4 + 0.33 + 0 + 0 + 0 + 0 + 0 10 = 0.28 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 27 R-Precision Let R be the total number of relevant docs for a given query The idea here is to compute the precision at the R-th position in the ranking For the query q1, the R value is 10 and there are 4 relevants among the top 10 documents in the ranking Thus, the R-Precision value for this query is 0.4 The R-precision measure is a useful for observing the behavior of an algorithm for individual queries Additionally, one can also compute an average R-precision figure over a set of queries However, using a single number to evaluate a algorithm over several queries might be quite imprecise Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 28 Precision Histograms The R-precision computed for several queries can be used to compare two algorithms as follows Let, RPA(i) : R-precision for algorithm A for the i-th query RPB(i) : R-precision for algorithm B for the i-th query Define, for instance, the difference RPA/B(i) = RPA(i) − RPB(i) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 29 Precision Histograms Figure below illustrates the RPA/B(i) values for two retrieval algorithms over 10 example queries The algorithm A performs better for 8 of the queries, while the algorithm B performs better for the other 2 queries Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 30 MRR: Mean Reciprocal Rank MRR is a good metric for those cases in which we are interested in the first correct answer such as Question-Answering (QA) systems Search engine queries that look for specific sites URL queries Homepage queries Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 31 MRR: Mean Reciprocal Rank Let, Ri: ranking relative to a query qi Scorrect(Ri): position of the first correct answer in Ri Sh: threshold for ranking position Then, the reciprocal rank RR(Ri) for query qi is given by RR(Ri) = 1 Scorrect(Ri) if Scorrect(Ri) ≤ Sh 0 otherwise The mean reciprocal rank (MRR) for a set Q of Nq queries is given by MRR(Q) = Nq i RR(Ri) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 32 The E-Measure A measure that combines recall and precision The idea is to allow the user to specify whether he is more interested in recall or in precision The E measure is defined as follows E(j) = 1 − 1 + b2 b2 r(j) + 1 P(j) where r(j) is the recall at the j-th position in the ranking P(j) is the precision at the j-th position in the ranking b ≥ 0 is a user specified parameter E(j) is the E metric at the j-th position in the ranking Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 33 The E-Measure The parameter b is specified by the user and reflects the relative importance of recall and precision If b = 0 E(j) = 1 − P(j) low values of b make E(j) a function of precision If b → ∞ limb→∞ E(j) = 1 − r(j) high values of b make E(j) a function of recal For b = 1, the E-measure becomes the F-measure Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 34 F-Measure: Harmonic Mean The F-measure is also a single measure that combines recall and precision F(j) = 2 1 r(j) + 1 P(j) where r(j) is the recall at the j-th position in the ranking P(j) is the precision at the j-th position in the ranking F(j) is the harmonic mean at the j-th position in the ranking Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 35 F-Measure: Harmonic Mean The function F assumes values in the interval [0, 1] It is 0 when no relevant documents have been retrieved and is 1 when all ranked documents are relevant Further, the harmonic mean F assumes a high value only when both recall and precision are high To maximize F requires finding the best possible compromise between recall and precision Notice that setting b = 1 in the formula of the E-measure yields F(j) = 1 − E(j) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 36 DCG — Discounted Cumulated Gain Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 42 Discounted Cumulated Gain Precision and recall allow only binary relevance assessments As a result, there is no distinction between highly relevant docs and mildly relevant docs These limitations can be overcome by adopting graded relevance assessments and metrics that combine them The discounted cumulated gain (DCG) is a metric that combines graded relevance assessments effectively Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 43 Discounted Cumulated Gain When examining the results of a query, two key observations can be made: highly relevant documents are preferable at the top of the ranking than mildly relevant ones relevant documents that appear at the end of the ranking are less valuable Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 44 Discounted Cumulated Gain Consider that the results of the queries are graded on a scale 0–3 (0 for non-relevant, 3 for strong relevant docs) For instance, for queries q1 and q2, consider that the graded relevance scores are as follows: Rq1 = { [d3, 3], [d5, 3], [d9, 3], [d25, 2], [d39, 2], [d44, 2], [d56, 1], [d71, 1], [d89, 1], [d123, 1] } Rq2 = { [d3, 3], [d56, 2], [d129, 1] } That is, while document d3 is highly relevant to query q1, document d56 is just mildly relevant Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 45 Discounted Cumulated Gain Given these assessments, the results of a new ranking algorithm can be evaluated as follows Specialists associate a graded relevance score to the top 10-20 results produced for a given query q This list of relevance scores is referred to as the gain vector G Considering the top 15 docs in the ranking produced for queries q1 and q2, the gain vectors for these queries are: G1 = (1, 0, 1, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 3) G2 = (0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 3) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 46 Discounted Cumulated Gain By summing up the graded scores up to any point in the ranking, we obtain the cumulated gain (CG) For query q1, for instance, the cumulated gain at the first position is 1, at the second position is 1+0, and so on Thus, the cumulated gain vectors for queries q1 and q2 are given by CG1 = (1, 1, 2, 2, 2, 5, 5, 5, 5, 7, 7, 7, 7, 7, 10) CG2 = (0, 0, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 6) For instance, the cumulated gain at position 8 of CG1 is equal to 5 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 47 Discounted Cumulated Gain In formal terms, we define Given the gain vector Gj for a test query qj, the CGj associated with it is defined as CGj[i] =    Gj[1] if i = 1; Gj[i] + CGj[i − 1] otherwise where CGj[i] refers to the cumulated gain at the ith position of the ranking for query qj Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 48 Discounted Cumulated Gain We also introduce a discount factor that reduces the impact of the gain as we move upper in the ranking A simple discount factor is the logarithm of the ranking position If we consider logs in base 2, this discount factor will be log2 2 at position 2, log2 3 at position 3, and so on By dividing a gain by the corresponding discount factor, we obtain the discounted cumulated gain (DCG) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 49 Discounted Cumulated Gain More formally, Given the gain vector Gj for a test query qj, the vector DCGj associated with it is defined as DCGj[i] =    Gj[1] if i = 1; Gj [i] log2 i + DCGj[i − 1] otherwise where DCGj[i] refers to the discounted cumulated gain at the ith position of the ranking for query qj Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 50 Discounted Cumulated Gain For the example queries q1 and q2, the DCG vectors are given by DCG1 = (1.0, 1.0, 1.6, 1.6, 1.6, 2.8, 2.8, 2.8, 2.8, 3.4, 3.4, 3.4, 3.4, 3.4, 4.2) DCG2 = (0.0, 0.0, 1.3, 1.3, 1.3, 1.3, 1.3, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 1.6, 2.4) Discounted cumulated gains are much less affected by relevant documents at the end of the ranking By adopting logs in higher bases the discount factor can be accentuated Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 51 DCG Curves To produce CG and DCG curves over a set of test queries, we need to average them over all queries Given a set of Nq queries, average CG[i] and DCG[i] over all queries are computed as follows CG[i] = Nq j=1 CGj[i] Nq ; DCG[i] = Nq j=1 DCGj[i] Nq For instance, for the example queries q1 and q2, these averages are given by CG = (0.5, 0.5, 2.0, 2.0, 2.0, 3.5, 3.5, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0) DCG = (0.5, 0.5, 1.5, 1.5, 1.5, 2.1, 2.1, 2.2, 2.2, 2.5, 2.5, 2.5, 2.5, 2.5, 3.3) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 52 DCG Curves Then, average curves can be drawn by varying the rank positions from 1 to a pre-established threshold Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 53 Ideal CG and DCG Metrics Recall and precision figures are computed relatively to the set of relevant documents CG and DCG scores, as defined above, are not computed relatively to any baseline This implies that it might be confusing to use them directly to compare two distinct retrieval algorithms One solution to this problem is to define a baseline to be used for normalization This baseline are the ideal CG and DCG metrics, as we now discuss Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 54 Ideal CG and DCG Metrics For a given test query q, assume that the relevance assessments made by the specialists produced: n3 documents evaluated with a relevance score of 3 n2 documents evaluated with a relevance score of 2 n1 documents evaluated with a score of 1 n0 documents evaluated with a score of 0 The ideal gain vector IG is created by sorting all relevance scores in decreasing order, as follows: IG = (3, . . . , 3, 2, . . . , 2, 1, . . . , 1, 0, . . . , 0) For instance, for the example queries q1 and q2, we have IG1 = (3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0) IG2 = (3, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 55 Ideal CG and DCG Metrics Ideal CG and ideal DCG vectors can be computed analogously to the computations of CG and DCG For the example queries q1 and q2, we have ICG1 = (3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, 19, 19) ICG2 = (3, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6) The ideal DCG vectors are given by IDCG1 = (3.0, 6.0, 7.9, 8.9, 9.8, 10.5, 10.9, 11.2, 11.5, 11.8, 11.8, 11.8, 11.8, 11.8, 11.8) IDCG2 = (3.0, 5.0, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6, 5.6) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 56 Ideal CG and DCG Metrics Further, average ICG and average IDCG scores can be computed as follows ICG[i] = Nq j=1 ICGj[i] Nq ; IDCG[i] = Nq j=1 IDCGj[i] Nq For instance, for the example queries q1 and q2, ICG and IDCG vectors are given by ICG = (3.0, 5.5, 7.5, 8.5, 9.5, 10.5, 11.0, 11.5, 12.0, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5) IDCG = (3.0, 5.5, 6.8, 7.3, 7.7, 8.1, 8.3, 8.4, 8.6, 8.7, 8.7, 8.7, 8.7, 8.7, 8.7) By comparing the average CG and DCG curves for an algorithm with the average ideal curves, we gain insight on how much room for improvement there is Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 57 Normalized DCG Precision and recall figures can be directly compared to the ideal curve of 100% precision at all recall levels DCG figures, however, are not build relative to any ideal curve, which makes it difficult to compare directly DCG curves for two distinct ranking algorithms This can be corrected by normalizing the DCG metric Given a set of Nq test queries, normalized CG and DCG metrics are given by NCG[i] = CG[i] ICG[i] ; NDCG[i] = DCG[i] IDCG[i] Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 58 Normalized DCG For instance, for the example queries q1 and q2, NCG and NDCG vectors are given by NCG = (0.17, 0.09, 0.27, 0.24, 0.21, 0.33, 0.32, 0.35, 0.33, 0.40, 0.40, 0.40, 0.40, 0.40, 0.64) NDCG = (0.17, 0.09, 0.21, 0.20, 0.19, 0.25, 0.25, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.38) The area under the NCG and NDCG curves represent the quality of the ranking algorithm Higher the area, better the results are considered to be Thus, normalized figures can be used to compare two distinct ranking algorithms Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 59 Discussion on DCG Metrics CG and DCG metrics aim at taking into account multiple level relevance assessments This has the advantage of distinguishing highly relevant documents from mildly relevant ones The inherent disadvantages are that multiple level relevance assessments are harder and more time consuming to generate Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 60 Discussion on DCG Metrics Despite these inherent difficulties, the CG and DCG metrics present benefits: They allow systematically combining document ranks and relevance scores Cumulated gain provides a single metric of retrieval performance at any position in the ranking It also stresses the gain produced by relevant docs up to a position in the ranking, which makes the metrics more imune to outliers Further, discounted cumulated gain allows down weighting the impact of relevant documents found late in the ranking Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 61 Rank Correlation Metrics Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 74 Rank Correlation Metrics Precision and recall allow comparing the relevance of the results produced by two ranking functions However, there are situations in which we cannot directly measure relevance we are more interested in determining how differently a ranking function varies from a second one that we know well In these cases, we are interested in comparing the relative ordering produced by the two rankings This can be accomplished by using statistical functions called rank correlation metrics Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 75 Rank Correlation Metrics Let rankings R1 and R2 A rank correlation metric yields a correlation coefficient C(R1, R2) with the following properties: −1 ≤ C(R1, R2) ≤ 1 if C(R1, R2) = 1, the agreement between the two rankings is perfect i.e., they are the same. if C(R1, R2) = −1, the disagreement between the two rankings is perfect i.e., they are the reverse of each other. if C(R1, R2) = 0, the two rankings are completely independent. increasing values of C(R1, R2) imply increasing agreement between the two rankings. Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 76 The Spearman Coefficient Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 77 The Spearman Coefficient The Spearman coefficient is likely the mostly used rank correlation metric It is based on the differences between the positions of a same document in two rankings Let s1,j be the position of a document dj in ranking R1 and s2,j be the position of dj in ranking R2 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 78 The Spearman Coefficient Consider 10 example documents retrieved by two distinct rankings R1 and R2. Let s1,j and s2,j be the document position in these two rankings, as follows: documents s1,j s2,j si,j − s2,j (s1,j − s2,j)2 d123 1 2 -1 1 d84 2 3 -1 1 d56 3 1 +2 4 d6 4 5 -1 1 d8 5 4 +1 1 d9 6 7 -1 1 d511 7 8 -1 1 d129 8 10 -2 4 d187 9 6 +3 9 d25 10 9 +1 1 Sum of Square Distances 24 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 79 The Spearman Coefficient By plotting the rank positions for R1 and R2 in a 2-dimensional coordinate system, we observe that there is a strong correlation between the two rankings Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 80 The Spearman Coefficient To produce a quantitative assessment of this correlation, we sum the squares of the differences for each pair of rankings If there are K documents ranked, the maximum value for the sum of squares of ranking differences is given by K × (K2 − 1) 3 Let K = 10 If the two rankings were in perfect disagreement, then this value is (10 × (102 − 1))/3, or 330 On the other hand, if we have a complete agreement the sum is 0 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 81 The Spearman Coefficient Let us consider the fraction K j=1(s1,j − s2,j)2 K×(K2−1) 3 Its value is 0 when the two rankings are in perfect agreement +1 when they are in perfect disagreement If we multiply the fraction by 2, its value shifts to the range [0, +2] If we now subtract the result from 1, the resultant value shifts to the range [−1, +1] Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 82 The Spearman Coefficient This reasoning suggests defining the correlation between the two rankings as follows Let s1,j and s2,j be the positions of a document dj in two rankings R1 and R2, respectively Define S(R1, R2) = 1 − 6× K j=1(s1,j−s2,j)2 K×(K2−1) where S(R1, R2) is the Spearman rank correlation coefficient K indicates the size of the ranked sets Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 83 The Spearman Coefficient For the rankings in Figure below, we have S(R1, R2) = 1 − 6 × 24 10 × (102 − 1) = 1 − 144 990 = 0.854 documents s1,j s2,j si,j − s2,j (s1,j − s2,j)2 d123 1 2 -1 1 d84 2 3 -1 1 d56 3 1 +2 4 d6 4 5 -1 1 d8 5 4 +1 1 d9 6 7 -1 1 d511 7 8 -1 1 d129 8 10 -2 4 d187 9 6 +3 9 d25 10 9 +1 1 Sum of Square Distances 24 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 84 The Kendall Tau Coefficient Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 85 The Kendall Tau Coefficient It is difficult to assign an operational interpretation to Spearman coefficient One alternative is to use a coefficient that has a natural and intuitive interpretation, as the Kendall Tau coefficient Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 86 The Kendall Tau Coefficient When we think of rank correlations, we think of how two rankings tend to vary in similar ways To illustrate, consider two documents dj and dk and their positions in the rankings R1 and R2 Further, consider the differences in rank positions for these two documents in each ranking, i.e., s1,k − s1,j s2,k − s2,j If these differences have the same sign, we say that the document pair [dk, dj] is concordant in both rankings If they have different signs, we say that the document pair is discordant in the two rankings Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 87 The Kendall Tau Coefficient Consider the top 5 documents in rankings R1 and R2 documents s1,j s2,j si,j − s2,j d123 1 2 -1 d84 2 3 -1 d56 3 1 +2 d6 4 5 -1 d8 5 4 +1 The ordered document pairs in ranking R1 are [d123, d84], [d123, d56], [d123, d6], [d123, d8], [d84, d56], [d84, d6], [d84, d8], [d56, d6], [d56, d8], [d6, d8] for a total of 1 2 × 5 × 4, or 10 ordered pairs Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 88 The Kendall Tau Coefficient Repeating the same exercise for the top 5 documents in ranking R2, we obtain [d56, d123], [d56, d84], [d56, d8], [d56, d6], [d123, d84], [d123, d8], [d123, d6], [d84, d8], [d84, d6], [d8, d6] We compare these two sets of ordered pairs looking for concordant and discordant pairs Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 89 The Kendall Tau Coefficient Let us mark with a C the concordant pairs and with a D the discordant pairs For ranking R1, we have C, D, C, C, D, C, C, C, C, D For ranking R2, we have D, D, C, C, C, C, C, C, C, D Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 90 The Kendall Tau Coefficient That is, a total of 20, i.e., K(K − 1), ordered pairs are produced jointly by the two rankings Among these, 14 pairs are concordant and 6 pairs are discordant The Kendall Tau coefficient is defined as τ(R1, R2) = P(R1 = R2) − P(R1 = R2) In our example τ(R1, R2) = 14 20 − 6 20 = 0.4 Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 91 The Kendall Tau Coefficient Let, ∆(R1, R2): number of discordant document pairs in R1 and R2 K(K − 1) − ∆(R1, R2): number of concordant document pairs in R1 and R2 Then, P(R1 = R2) = K(K − 1) − ∆(R1, R2) K(K − 1) P(R1 = R2) = ∆(R1, R2) K(K − 1) which yields τ(R1, R2) = 1 − 2×∆(R1,R2) K(K−1) Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 92 The Kendall Tau Coefficient For the case of our previous example, we have ∆(R1, R2) = 6 K = 5 Thus, τ(R1, R2) = 1 − 2 × 6 5(5 − 1) = 0.4 as before The Kendall Tau coefficient is defined only for rankings over a same set of elements Most important, it has a simpler algebraic structure than the Spearman coefficient Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 93 Side-by-Side Panels Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 125 Side-by-Side Panels A form of evaluating two different systems is to evaluate their results side by side Typically, the top 10 results produced by the systems for a given query are displayed in side-by-side panels Presenting the results side by side allows controlling: differences of opinion among subjects influences on the user opinion produced by the ordering of the top results Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 126 Side-by-Side Panels Side by side panels for Yahoo! and Google Top 5 answers produced by each search engine, with regard to the query “information retrieval evaluation” Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 127 Side-by-Side Panels The side-by-side experiment is simply a judgement on which side provides better results for a given query By recording the interactions of the users, we can infer which of the answer sets are preferred to the query Side by side panels can be used for quick comparison of distinct search engines Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 128 A/B Testing & Crowdsourcing Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 130 Crowdsourcing There are a number of limitations with current approaches for relevance evaluation For instance, the Cranfield paradigm is expensive and has obvious scalability issues Recently, crowdsourcing has emerged as a feasible alternative for relevance evaluation Crowdsourcing is a term used to describe tasks that are outsourced to a large group of people, called “workers” It is an open call to solve a problem or carry out a task, one which usually involves a monetary value in exchange for such service Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 132 Crowdsourcing To illustrate, crowdsourcing has been used to validate research on the quality of search snippets One of the most important aspects of crowdsourcing is to design the experiment carefully It is important to ask the right questions and to use well-known usability techniques Workers are not information retrieval experts, so the task designer should provide clear instructions Chap 04: Retrieval Evaluation, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition – p. 133