1 Evaluation Rong Jin 2 Evaluation oEvaluation is key to building effective and efficient search engines nusually carried out in controlled experiments nonline testing can also be done o oEffectiveness and efficiency are related nHigh efficiency may be obtained at the price of effectiveness 3 Evaluation Corpus oTest collections consisting of documents, queries, and relevance judgments, e.g., TP_tmp.png 4 Test Collections TP_tmp.png TP_tmp.png 5 TREC Topic Example C:\Users\croft\Desktop\chap8-1.tif 6 Relevance Judgments oObtaining relevance judgments is an expensive, time-consuming process nwho does it? nwhat are the instructions? nwhat is the level of agreement? oTREC judgments ndepend on task being evaluated ngenerally binary nreasonable agreement because of “narrative” 7 Pooling oExhaustive judgments for all documents in a collection is not practical oPooling technique is used in TREC ntop k results (k varied between 50 and 200) from the rankings obtained by different search engines are merged into a pool nduplicates are removed ndocuments are presented in some random order to the relevance judges oProduces a large number of relevance judgments for each query, although still incomplete 8 Pooling Search engine 1 d1 d2 d100 d75 Search engine 2 d43 d2 d501 d75 Search engine 3 d1 d13 d100 d75 Query 1,000,000 docs d2 d43 d13 d75 d1 d100 d501 To be judged o 9 Bias in Relevance Judgments oRelevance judgment is subjective oDisagreement among assessors 10 Combine Multiple Judgments oJudges disagree a lot. How to combine judgments from multiple reviewers ? nUnion nIntersection nMajority vote 11 Combine Multiple Judgments oLarge impact on absolute performance numbers oVirtually no impact on ranking of systems 12 Query Logs oUsed for tuning and evaluating search engines nalso for techniques such as query suggestion and spell checking oTypical contents nUser identifier or user session identifier nQuery terms - stored exactly as user entered nList of URLs of results, their ranks on the result list, and whether they were clicked on nTimestamp(s) - records the time of user events such as query submission, clicks 13 Query Logs oClicks are not relevance judgments nalthough they are correlated nbiased by a number of factors such as rank on result list oCan use clickthough data to predict preferences between pairs of documents nappropriate for tasks with multiple levels of relevance, focused on user relevance nvarious “policies” used to generate preferences 14 Example Click Policy oSkip Above and Skip Next nclick data n n n ngenerated preferences TP_tmp.png TP_tmp.png 15 Query Logs oClick data can also be aggregated to remove noise oClick distribution information ncan be used to identify clicks that have a higher frequency than would be expected nhigh correlation with relevance ne.g., using click deviation to filter clicks for preference-generation policies 16 Evaluation Metrics: Classification View Relevant Retrieved Irrelevant Retrieved Irrelevant Rejected Relevant Rejected Relevant Not relevant Retrieved Not Retrieved Doc Action 17 Evaluation Metrics: Example C:\Users\croft\Desktop\chap8-2.tif Recall = 2/6 = 0.33 Precision = 2/3 = 0.67 18 Evaluation Metrics: Example C:\Users\croft\Desktop\chap8-2.tif Recall = 5/6 = 0.83 Precision = 5/6 = 0.83 19 F Measure oHarmonic mean of recall and precision o o nWhy harmonic mean? nharmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large TP_tmp.png 20 Evaluation Metrics: Example C:\Users\croft\Desktop\chap8-2.tif Recall = 2/6 = 0.33 Precision = 2/3 = 0.67 F = 2*Recall*Precision/(Recall + Precision) = 2*0.33*0.67/(0.33 + 0.67) = 0.22 21 Evaluation Metrics: Example C:\Users\croft\Desktop\chap8-2.tif Recall = 5/6 = 0.83 Precision = 5/6 = 0.83 F = 2*Recall*Precision/(Recall + Precision) = 2*0.83*0.83/(0.83 + 0.83) = 0.83 22 Evaluation for Ranking oAverage precision nAveraging the precision values from the rank positions where a relevant document was retrieved nSet precision values to be zero for the not retrieved documents 23 Average Precision: Example C:\Users\croft\Desktop\chap8-2.tif 24 Average Precision: Example C:\Users\croft\Desktop\chap8-2.tif TP_tmp.png 25 Average Precision: Example C:\Users\croft\Desktop\chap8-2.tif TP_tmp.png 26 Average Precision: Example C:\Users\croft\Desktop\chap8-2.tif addin_tmp.png Miss one relevant document 27 Average Precision: Example C:\Users\croft\Desktop\chap8-2.tif addin_tmp.png Miss two relevant documents 28 Mean Average Precision (MAP) C:\Users\croft\Desktop\chap8-3.tif TP_tmp.png 29 Mean Average Precision (MAP) oSummarize rankings from multiple queries by averaging average precision oMost commonly used measure in research papers oAssumes user is interested in finding many relevant documents for each query oRequires many relevance judgments in text collection 30 Recall-Precision Graph C:\Users\croft\Desktop\chap8-4.tif C:\Users\croft\Desktop\chap8-3.tif Multiple precision for some recalls 31 Interpolation o o nwhere S is the set of observed (R,P) points oDefines precision at any recall level as the maximum precision observed in any recall-precision point at a higher recall level nproduces a step function ndefines precision at recall 0.0 o TP_tmp.png Interpolation 32 C:\Users\croft\Desktop\chap8-3.tif Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 Interpolation 33 C:\Users\croft\Desktop\chap8-3.tif Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 Interpolation 34 C:\Users\croft\Desktop\chap8-3.tif Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 Interpolation 35 C:\Users\croft\Desktop\chap8-3.tif Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0 Interpolation 36 C:\Users\croft\Desktop\chap8-3.tif Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0 Interpolation 37 C:\Users\croft\Desktop\chap8-3.tif Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0 0.67 Interpolation 38 C:\Users\croft\Desktop\chap8-3.tif Recall Interpolated Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0 1.0 0.67 0.67 0.5 0.5 0.5 0.5 0.5 0.5 39 Interpolation C:\Users\croft\Desktop\chap8-5.tif C:\Users\croft\Desktop\chap8-4.tif 40 Average Precision at Standard Recall Levels TP_tmp.png • Only consider standard recall levels: varying from 0.0 to 1.0 at the incremental of 0.1 • Recall-precision graph plotted by simply joining the average precision points at the standard recall levels 41 Average Recall-Precision Graph C:\Users\croft\Desktop\chap8-6.tif C:\Users\croft\Desktop\chap8-5.tif 42 Graph for 50 Queries C:\Users\croft\Desktop\chap8-7.tif 43 Focusing on Top Documents oUsers tend to look at only the top part of the ranked result list to find relevant documents oSome search tasks have only one relevant document ne.g., navigational search, question answering oRecall not appropriate ninstead need to measure how well the search engine does at retrieving relevant documents at very high ranks 44 Focusing on Top Documents oPrecision at Rank R nR typically 5, 10, 20 neasy to compute, average, understand nnot sensitive to rank positions less than R oReciprocal Rank nreciprocal of the rank at which the first relevant document is retrieved nMean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries nvery sensitive to rank position 45 MRR C:\Users\croft\Desktop\chap8-2.tif RR = 1/1 = 1 RR = 1/2 = 0.5 MRR = (1+0.5)/2 = 0.75 46 Discounted Cumulative Gain (DCG) oPopular measure for evaluating web search and related tasks nUse graded relevance oTwo assumptions: nHighly relevant documents are more useful than marginally relevant document nthe lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined 47 Discounted Cumulative Gain oGain is accumulated starting at the top of the ranking and is discounted at lower ranks nTypical discount is 1/log (rank) nWith base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3 TP_tmp.png TP_tmp.png 48 DCG Example o10 ranked documents judged on 0-3 relevance scale: n3, 2, 3, 0, 0, 1, 2, 2, 3, 0 odiscounted gain: n3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 n= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 oDCG: n3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 o o 49 Efficiency Metrics oQuery throughput nThe number of queries processed per second oQuery latency nThe time between issuing a query and receiving a response, measured in millisecond nUsers consider instantaneous if the latency is less than 150 millisecond oRelation between query throughput and latency nHigh throughput à handle multiple queries simultaneously à high latency 50 Significance Tests oGiven the results from a number of queries, how can we conclude that ranking algorithm A is better than algorithm B? oA significance test nnull hypothesis: no difference between A and B nalternative hypothesis: B is better than A nthe power of a test is the probability that the test will reject the null hypothesis correctly nincreasing the number of queries in the experiment also increases power of test 51 Example Experimental Results TP_tmp.png Significance level: a = 0.05 Probability for B=A 52 Example Experimental Results TP_tmp.png TP_tmp.png t-test t = 2.33 p-value = 0.02 Significance level: a = 0.05 Probability for B=A à B is better than A Probability for B=A is 0.02 Avg 41.1 62.5 53 Online Testing oTest (or even train) using live traffic on a search engine oBenefits: nreal users, less biased, large amounts of test data oDrawbacks: nnoisy data, can degrade user experience oOften done on small proportion (1-5%) of live traffic 54 Summary oNo single measure is the correct one for any application nchoose measures appropriate for task nuse a combination nshows different aspects of the system effectiveness oUse significance tests (t-test) oAnalyze performance of individual queries