Model evaluation
qualitative  following the denition of data mining
(Piatetski-Shapiro, Fayaad, 90th):
how new, interesting, useful and understandable the model is
(not) corresponding to expectations (common sense), to
knowledge of an expert
quantitative
1
Model evaluation
qualitative  following the denition of data mining
(Piatetski-Shapiro, Fayaad, 90th):
how new, interesting, useful and understandable the model is
(not) corresponding to expectations (common sense), to
knowledge of an expert
quantitative
2
Evaluation for dierent machne learning task
clustering  is the number of clusters and the structure
appropriate
associations  which rule is interesting
outlier detection  top N outliers
classication and regression
3
Evaluation for dierent machne learning task
clustering  is the number of clusters and the structure
appropriate
associations  which rule is interesting
outlier detection  top N outliers
classication and regression
4
Classication
Training set
|
Learning algorithm
|
input atributes of a test instance
  > Model/Hypothesis/Classier   >
predicted class label
accuracy [celková správnost]  how often returns the correct
class label
speed  learning, testing
robustness  to make correct predictions given noisy data or
data with missing values
scalability  ecient for large amounts of data
5
Classication
Training set
|
Learning algorithm
|
input atributes of a test instance
  > Model/Hypothesis/Classier   >
predicted class label
accuracy [celková správnost]  how often returns correct class
label
speed  learning, testing
robustness  to make correct predictions given noisy data or
data with missing values
scalability  ecient for large amounts of data
6
Classication
main criterion  how succesful Model is on data
a principal decision  what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
7
Classication
main criterion  how succesful Model is on data
a principal decision  what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
8
Classication
main criterion  how succesful Model is on data.
a principal decision  what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Classication
main criterion  how succesful Model is on data.
a principal decision  what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Classication
main criterion  how succesful Model is on data.
a principal decision  what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Classication
main criterion  how succesful Model is on data.
a principal decision  what data to use for the most accurate
prediction of model accuracy
Most common (but correct?)
learning data
test set
cross-validation
leave-one-out
Is there any other possibility, maybe better? bootstraping, splitting
data into disjunctive parts, ...
9
Confusion matrix
TP, TN, FP, FN ... the number of true positive, true negative, false
positive, false negative
P, N ... cardinality of positive and negative samples
10
Evaluation measures
(overall) accuracy [celková správnost]
Acc = TP+TN
TP+TN+FP+FN
error rate, (misclassication rate) [chyba]
Err = 1 − Acc = wFP∗FP+wFN∗FN
TP+TN+FP+FN
wFP, wFN ... weight of FP and FN errors
default wFP, wFN = 1
precision
TP
TP+FP
sensitivity, true positive rate, recall
TP
TP+FN
specicity, true negative rate
TN
TN+FP
11
Evaluation measures
(overall) accuracy [celková správnost]
Acc = TP+TN
TP+TN+FP+FN
error rate, (misclassication rate) [chyba]
Err = 1 − Acc = wFP∗FP+wFN∗FN
TP+TN+FP+FN
wFP, wFN ... weight of FP and FN errors
default wFP, wFN = 1
precision
TP
TP+FP
sensitivity, true positive rate, recall
TP
TP+FN
specicity, true negative rate
TN
TN+FP
12
Evaluation measures
Accuracy for a class P, N
F-measures combines precision and recall
F, F1, F-score = hramonic mean of precision and recall
F1 = 2∗precision∗recall
precision+recall
Fβ = (1+β2)precision∗recall
β2∗precision+recall
β ... a non-negative real number
13
Evaluation measures for comparing classiers
Learning curve
Accuracy as a function of number of iterations
ROC curve  relation between TP and FP
14
Sampling
holdout  split data randomly to learning and test data, e.g.
2/3 vs. 1/3
stratied sampling  preserve relative frequency of classes in
samples
Random (sub)sampling  holdout method is repeated k times
The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
bootstraping
undersampling/oversampling of a class  for processing
imbalanced data
15