1
Evaluation
Rong Jin

2
Evaluation
oEvaluation is key to building effective and efficient search engines
nusually carried out in controlled experiments
nonline testing can also be done
o
oEffectiveness and efficiency are related
nHigh efficiency may be obtained at the price of effectiveness

3
Evaluation Corpus
oTest collections consisting of documents, queries, and relevance judgments, e.g.,
TP_tmp.png

4
Test Collections
TP_tmp.png TP_tmp.png

5
TREC Topic Example
C:\Users\croft\Desktop\chap8-1.tif

6
Relevance Judgments
oObtaining relevance judgments is an expensive, time-consuming process
nwho does it?
nwhat are the instructions?
nwhat is the level of agreement?
oTREC judgments
ndepend on task being evaluated
ngenerally binary
nreasonable agreement because of “narrative”

7
Pooling
oExhaustive judgments for all documents in a collection is not practical
oPooling technique is used in TREC
ntop k results (k varied between 50 and 200) from the rankings obtained by different search engines
are merged into a pool
nduplicates are removed
ndocuments are presented in some random order to the relevance judges
oProduces a large number of relevance judgments for each query, although still incomplete

8
Pooling
Search engine 1
d1
d2
d100
d75
Search engine 2
d43
d2
d501
d75
Search engine 3
d1
d13
d100
d75
Query
1,000,000 docs
d2
d43
d13
d75
d1
d100
d501
To be judged
o

9
Bias in Relevance Judgments
oRelevance judgment is subjective
oDisagreement among assessors

10
Combine Multiple Judgments
oJudges disagree a lot. How to combine judgments from multiple reviewers ?
nUnion
nIntersection
nMajority vote

11
Combine Multiple Judgments
oLarge impact on absolute performance numbers
oVirtually no impact on ranking of systems

12
Query Logs
oUsed for tuning and evaluating search engines
nalso for techniques such as query suggestion and spell checking
oTypical contents
nUser identifier or user session identifier
nQuery terms - stored exactly as user entered
nList of URLs of results, their ranks on the result list, and whether they were clicked on
nTimestamp(s) - records the time of user events such as query submission, clicks

13
Query Logs
oClicks are not relevance judgments
nalthough they are correlated
nbiased by a number of factors such as rank on result list
oCan use clickthough data to predict preferences between pairs of documents
nappropriate for tasks with multiple levels of relevance, focused on user relevance
nvarious “policies” used to generate preferences

14
Example Click Policy
oSkip Above and Skip Next
nclick data
n
n
n
ngenerated preferences
TP_tmp.png TP_tmp.png

15
Query Logs
oClick data can also be aggregated to remove noise
oClick distribution information
ncan be used to identify clicks that have a higher frequency than would be expected
nhigh correlation with relevance
ne.g., using click deviation to filter clicks for preference-generation policies

16
Evaluation Metrics: Classification View
Relevant Retrieved
Irrelevant Retrieved
Irrelevant Rejected
Relevant Rejected
Relevant
Not relevant
Retrieved
Not Retrieved
Doc
Action

17
Evaluation Metrics: Example
C:\Users\croft\Desktop\chap8-2.tif
Recall = 2/6 = 0.33
Precision = 2/3 = 0.67

18
Evaluation Metrics: Example
C:\Users\croft\Desktop\chap8-2.tif
Recall = 5/6 = 0.83
Precision = 5/6 = 0.83

19
F Measure
oHarmonic mean of recall and precision
o
o
nWhy harmonic mean?
nharmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected
more by outliers that are unusually large
TP_tmp.png

20
Evaluation Metrics: Example
C:\Users\croft\Desktop\chap8-2.tif
Recall = 2/6 = 0.33
Precision = 2/3 = 0.67
F = 2*Recall*Precision/(Recall + Precision)
   = 2*0.33*0.67/(0.33 + 0.67) =  0.22

21
Evaluation Metrics: Example
C:\Users\croft\Desktop\chap8-2.tif
Recall = 5/6 = 0.83
Precision = 5/6 = 0.83
F = 2*Recall*Precision/(Recall + Precision)
   = 2*0.83*0.83/(0.83 + 0.83) =  0.83

22
Evaluation for Ranking
oAverage precision
nAveraging the precision values from the rank positions where a relevant document was retrieved
nSet precision values to be zero for the not retrieved documents

23
Average Precision: Example
C:\Users\croft\Desktop\chap8-2.tif

24
Average Precision: Example
C:\Users\croft\Desktop\chap8-2.tif TP_tmp.png

25
Average Precision: Example
C:\Users\croft\Desktop\chap8-2.tif TP_tmp.png

26
Average Precision: Example
C:\Users\croft\Desktop\chap8-2.tif
addin_tmp.png
Miss one relevant document

27
Average Precision: Example
C:\Users\croft\Desktop\chap8-2.tif
addin_tmp.png
Miss two relevant documents

28
Mean Average Precision (MAP)
C:\Users\croft\Desktop\chap8-3.tif TP_tmp.png

29
Mean Average Precision (MAP)
oSummarize rankings from multiple queries by averaging average precision
oMost commonly used measure in research papers
oAssumes user is interested in finding many relevant documents for each query
oRequires many relevance judgments in text collection

30
Recall-Precision Graph
C:\Users\croft\Desktop\chap8-4.tif C:\Users\croft\Desktop\chap8-3.tif
Multiple precision for some recalls

31
Interpolation
o
o
nwhere S is the set of observed (R,P) points
oDefines precision at any recall level as the maximum precision observed in any recall-precision
point at a higher recall level
nproduces a step function
ndefines precision at recall 0.0
o
TP_tmp.png

Interpolation
32
C:\Users\croft\Desktop\chap8-3.tif
Recall
Interpolated
Precision
0.0    0.1   0.2    0.3    0.4    0.5   0.6  0.7  0.8  0.9  1.0
1.0

Interpolation
33
C:\Users\croft\Desktop\chap8-3.tif
Recall
Interpolated
Precision
0.0    0.1   0.2    0.3    0.4    0.5   0.6  0.7  0.8  0.9  1.0
1.0

Interpolation
34
C:\Users\croft\Desktop\chap8-3.tif
Recall
Interpolated
Precision
0.0    0.1   0.2    0.3    0.4    0.5   0.6  0.7  0.8  0.9  1.0
1.0    1.0

Interpolation
35
C:\Users\croft\Desktop\chap8-3.tif
Recall
Interpolated
Precision
0.0    0.1   0.2    0.3    0.4    0.5   0.6  0.7  0.8  0.9  1.0
1.0    1.0   1.0

Interpolation
36
C:\Users\croft\Desktop\chap8-3.tif
Recall
Interpolated
Precision
0.0    0.1   0.2    0.3    0.4    0.5   0.6  0.7  0.8  0.9  1.0
1.0    1.0   1.0

Interpolation
37
C:\Users\croft\Desktop\chap8-3.tif
Recall
Interpolated
Precision
0.0    0.1   0.2    0.3    0.4    0.5   0.6  0.7  0.8  0.9  1.0
1.0    1.0   1.0    0.67

Interpolation
38
C:\Users\croft\Desktop\chap8-3.tif
Recall
Interpolated
Precision
0.0    0.1   0.2    0.3    0.4    0.5   0.6  0.7  0.8  0.9  1.0
1.0    1.0   1.0    0.67  0.67  0.5   0.5  0.5  0.5  0.5  0.5

39
Interpolation
C:\Users\croft\Desktop\chap8-5.tif C:\Users\croft\Desktop\chap8-4.tif

40
Average Precision at
Standard Recall Levels
TP_tmp.png
• Only consider standard recall levels: varying from 0.0
   to 1.0 at the incremental of 0.1
• Recall-precision graph plotted by simply joining the
  average precision points at the standard recall levels

41
Average Recall-Precision Graph
C:\Users\croft\Desktop\chap8-6.tif C:\Users\croft\Desktop\chap8-5.tif

42
Graph for 50 Queries
C:\Users\croft\Desktop\chap8-7.tif

43
Focusing on Top Documents
oUsers tend to look at only the top part of the ranked result list to find relevant documents
oSome search tasks have only one relevant document
ne.g., navigational search, question answering
oRecall not appropriate
ninstead need to measure how well the search engine does at retrieving relevant documents at very
high ranks

44
Focusing on Top Documents
oPrecision at Rank R
nR typically 5, 10, 20
neasy to compute, average, understand
nnot sensitive to rank positions less than R
oReciprocal Rank
nreciprocal of the rank at which the first relevant document is retrieved
nMean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries
nvery sensitive to rank position

45
MRR
C:\Users\croft\Desktop\chap8-2.tif
RR = 1/1 = 1
RR = 1/2 = 0.5
MRR = (1+0.5)/2 = 0.75

46
Discounted Cumulative Gain (DCG)
oPopular measure for evaluating web search and related tasks
nUse graded relevance
oTwo assumptions:
nHighly relevant documents are more useful than marginally relevant document
nthe lower the ranked position of a relevant document, the less useful it is for the user, since it
is less likely to be examined

47
Discounted Cumulative Gain
oGain is accumulated starting at the top of the ranking and is discounted at lower ranks
nTypical discount is 1/log (rank)
nWith base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
TP_tmp.png TP_tmp.png

48
DCG Example
o10 ranked documents judged on 0-3 relevance scale:
n3, 2, 3, 0, 0, 1, 2, 2, 3, 0
odiscounted gain:
n3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
n= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
oDCG:
n3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
o
o

49
Efficiency Metrics
oQuery throughput
nThe number of queries processed per second
oQuery latency
nThe time between issuing a query and receiving a response, measured in millisecond
nUsers consider instantaneous if the latency is less than 150 millisecond
oRelation between query throughput and latency
nHigh throughput à handle multiple queries simultaneously à high latency

50
Significance Tests
oGiven the results from a number of queries, how can we conclude that ranking algorithm A is better
than algorithm B?
oA significance test
nnull hypothesis: no difference between A and B
nalternative hypothesis: B is better than A
nthe power of a test is the probability that the test will reject the null hypothesis correctly
nincreasing the number of queries in the experiment also increases power of test

51
Example Experimental Results
TP_tmp.png
Significance level: a = 0.05
Probability for B=A

52
Example Experimental Results
TP_tmp.png TP_tmp.png
t-test
t = 2.33 p-value = 0.02
Significance level: a = 0.05
Probability for B=A
à B is better than A
Probability for B=A is 0.02
Avg           41.1     62.5

53
Online Testing
oTest (or even train) using live traffic on a search engine
oBenefits:
nreal users, less biased, large amounts of test data
oDrawbacks:
nnoisy data, can degrade user experience
oOften done on small proportion (1-5%) of live traffic

54
Summary
oNo single measure is the correct one for any application
nchoose measures appropriate for task
nuse a combination
nshows different aspects of the system effectiveness
oUse significance tests (t-test)
oAnalyze performance of individual queries