Model evaluation
qualitative – following the deﬁnition of data mining
(Piatetski-Shapiro, Fayaad, 90th):
how new, interesting, useful and understandable the model is
(not) corresponding to expectations (common sense), to
knowledge of an expert
quantitative
1
Model evaluation
qualitative – following the deﬁnition of data mining
(Piatetski-Shapiro, Fayaad, 90th):
how new, interesting, useful and understandable the model is
(not) corresponding to expectations (common sense), to
knowledge of an expert
quantitative
2
Evaluation for diﬀerent machne learning
task
clustering – is the number of clusters and the structure
appropriate
associations – which rule is interesting
outlier detection – top N outliers
classiﬁcation and regression
3
Evaluation for diﬀerent machne learning
task
clustering – is the number of clusters and the structure
appropriate
associations – which rule is interesting
outlier detection – top N outliers
classiﬁcation and regression
4
Classiﬁcation
Training set
|
Learning algorithm
|
input atributes of a test instance
– – –> Model/Hypothesis/Classiﬁer – – –>
predicted class label
accuracy [celková správnost] – how often returns the correct
class label
speed – learning, testing
robustness – to make correct predictions given noisy data or
data with missing values
scalability – eﬃcient for large amounts of data
comprehensibility – how is the model explainable
5
Classiﬁcation
Training set
|
Learning algorithm
|
input atributes of a test instance
– – –> Model/Hypothesis/Classiﬁer – – –>
predicted class label
accuracy [celková správnost] – how often returns correct class
label
speed – learning, testing
robustness – to make correct predictions given noisy data or
data with missing values
scalability – eﬃcient for large amounts of data
6
Classiﬁcation
main criterion – how succesful Model is on data
a principal decision – what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
7
Classiﬁcation
main criterion – how succesful Model is on data
a principal decision – what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
8
Classiﬁcation
main criterion – how succesful Model is on data.
a principal decision – what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Classiﬁcation
main criterion – how succesful Model is on data.
a principal decision – what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Classiﬁcation
main criterion – how succesful Model is on data.
a principal decision – what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Classiﬁcation
main criterion – how succesful Model is on data.
a principal decision – what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Confusion matrix
TP, TN, FP, FN ... the number of true positive, true negative, false
positive, false negative
P, N ... cardinality of positive and negative samples
10
Evaluation measures
(overall) accuracy [celková správnost]
Acc = TP+TN
TP+TN+FP+FN
error rate, (misclassiﬁcation rate) [chyba]
Err = 1 − Acc = wFP ∗FP+wFN ∗FN
TP+TN+FP+FN
wFP , wFN ... weight of FP and FN errors
default wFP , wFN = 1
precision
TP
TP+FP
sensitivity, true positive rate, recall
TP
TP+FN
speciﬁcity, true negative rate
TN
TN+FP
11
Evaluation measures
(overall) accuracy [celková správnost]
Acc = TP+TN
TP+TN+FP+FN
error rate, (misclassiﬁcation rate) [chyba]
Err = 1 − Acc = wFP ∗FP+wFN ∗FN
TP+TN+FP+FN
wFP , wFN ... weight of FP and FN errors
default wFP , wFN = 1
precision
TP
TP+FP
sensitivity, true positive rate, recall
TP
TP+FN
speciﬁcity, true negative rate
TN
TN+FP
12
Evaluation measures
Accuracy for a class P, N
F-measures combines precision and recall
F, F1, F-score = hramonic mean of precision and recall
F1 = 2∗precision∗recall
precision+recall
Fβ = (1+β2)precision∗recall
β2∗precision+recall
β ... a non-negative real number
13
Evaluation measures for regression trees
14
Comparing diﬀerent settings of classiﬁers
most common : Learning curve, ROC, Recall-Precision curve
Learning curve
Performance as a function of number of iterations
e.g. X= a number of learning instances, Y = accuracy
,
15
ROC curve
relation between TP and FP
for models that returns in addition to a prediction, also weight
(probability)
Declare xn to be a positive if P(y = 1 | xn) > Θ
otherwise declare it to be negative (y=0)
Number of TPs and FPs depends on threshold Θ. As we
change Θ, we get diﬀerent (TPR, FPR) points.
16
Comparing diﬀerent classiﬁers
17
Cross-validation
The following table gives a possible result of evaluating three
learning algorithms on a data set with 10-fold cross-validation:
t Fold Naive Bayes Decision tree Nearest neighbour
1 0.6809 0.7524 0.7164
2 0.7017 0.8964 0.8883
3 0.7012 0.6803 0.8410
4 0.6913 0.9102 0.6825
5 0.6333 0.7758 0.7599
6 0.6415 0.8154 0.8479
7 0.7216 0.6224 0.7012
8 0.7214 0.7585 0.4959
9 0.6578 0.9380 0.9279
10 0.7865 0.7524 0.7455
avg 0.6937 0.7902 0.7606
stdev 0.0448 0.1014 0.1248
The last two lines give the average and standard deviation over all
ten folds. Clearly the decision tree achieves the best result, but
should we completely discard nearest neighbour?
18
Signiﬁcance testing in cross-validation:
the paired t-test
For a pair of algorithms we calculate the diﬀerence in accuracy
on each fold; this diﬀerence is normally distributed if the two
accuracies are. Null hypothesis: the true diﬀerence is 0, so that
any diﬀerences in performance are attributed to chance. We
calculate a p-value using the normal distribution, and reject the
null hypothesis if the p-value is below our signiﬁcance level α.
We don’t have access to the true standard deviation in the
diﬀerences, which therefore needs to be estimated.Use
distribution is referred to as the t-distribution.
The extent to which the t-distribution is more heavy-tailed
than the normal distribution is regulated by the number of
degree of freedom: in our case this is equal to 1 less than the
number of folds (since the ﬁnal fold is completely determined
by the other ones).
19
Paired t-test
The numbers show pairwise diﬀerences in each fold. The null
hypothesis in each case is that the diﬀerences come from a normal
distribution with mean 0 and unknown standard deviation.
t Fold NB−DT NB−NN DT−NN
1 -0.0715 -0.0355 0.0361
2 -0.1947 -0.1866 0.0081
3 0.0209 -0.1398 -0.1607
4 -0.2189 0.0088 0.2277
5 -0.1424 -0.1265 0.0159
6 -0.1739 -0.2065 -0.0325
7 0.0992 0.0204 -0.0788
8 -0.0371 0.2255 0.2626
9 -0.2802 -0.2700 0.0102
10 0.0341 0.0410 0.0069
avg -0.0965 -0.0669 0.0295
stdev 0.1246 0.1473 0.1278
p-value 0.0369 0.1848 0.4833
The p-value in the last line of the table is calculated by means of
the t-distribution with k − 1 = 9 degrees of freedom, and only the
diﬀerence between the naive Bayes and decision tree algorithms is
found signiﬁcant at α = 0.05.
20
Interpretation of results over multiple data sets
The t-test is not appropriate for multiple data sets because
performance measures cannot be compared across data sets. In
order to compare two learning algorithms over multiple data sets we
need to use a test speciﬁcally designed for that purpose such as
Wilcoxon’s signed-rank test.
The idea is to rank the performance diﬀerences in absolute
value, from smallest (rank 1) to largest (rank n).
We then calculate the sum of ranks for positive and negative
diﬀerences separately, and take the smaller of these sums as
our test statistic.
For a large number of data sets (at least 25) this statistic can
be converted to one which is approximately normally
distributed, but for smaller numbers the critical value (the
value of the statistic at which the p-value equals α) can be
found in a statistical table.
21
Interpretation of results over multiple data sets II
The Wilcoxon signed-rank test assumes that larger
performance diﬀerences are better than smaller ones,
performance diﬀerences are treated as ordinals rather than
real-valued.
Furthermore, there is no normality assumption regarding the
distribution of these diﬀerences which means, among other
things, that the test is less sensitive to outliers.
In statistical terminology the test is ‘non-parametric’ as
opposed to a parametric test such as the t-test which assumes
a particular distribution. Parametric tests are generally more
powerful when that assumed distribution is appropriate but can
be misleading when it is not.
22
Wilcoxon’s signed-rank test
t Data set NB−DT Rank
1 -0.0715 4
2 -0.1947 8
3 0.0209 1
4 -0.2189 9
5 -0.1424 6
6 -0.1739 7
7 0.0992 5
8 -0.0371 3
9 -0.2802 10
10 0.0341 2
The sum of ranks for positive diﬀerences is 1 + 5 + 2 = 8 and for
negative diﬀerences 4 + 8 + 9 + 6 + 7 + 3 + 10 = 47. The critical
value for 10 data sets at the α = 0.05 level is 8, which means that if
the smallest of the two sums of ranks is less than or equal to 8 the
null hypothesis that the ranks are distributed the same for positive
and negative diﬀerences can be rejected. This applies in this case.
23
Multiple algorithms over multiple data sets
Friedman test tells us whether the average ranks as a whole
display signiﬁcant diﬀerences
further analysis is needed on a pairwise level. This is achieved
by applying a post-hoc test once the Friedman test gives
signiﬁcance - Nemenyi test
A variant of the Nemenyi test called the Bonferroni–Dunn test
can be applied when we perform pairwise tests only against a
control algorithm.
24
Sampling
holdout – split data randomly to learning and test data, e.g.
2/3 vs. 1/3
stratiﬁed sampling – preserve relative frequency of classes in
samples
Random (sub)sampling – holdout method is repeated k times
The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
bootstraping
undersampling/oversampling of a class – for processing
imbalanced data
25