PA196: Pattern Recognition
04. Classiﬁer performance: parameters, estimation and
comparison
Dr. Vlad Popovici
popovici@recetox.muni.cz
RECETOX
Masaryk University, Brno
Speciﬁc bibliography
• W.J. Krzanowski, D.J. Hand: ROC curves for continuous
data. CRC Press. 2009
• M.S. Pepe: The statistical evaluation of medical tests
for classiﬁcation and prediction. Oxford Univ Press.
2003
• N. Japkowicz, M. Shah. Evaluating Learning
Algorithms: A Classiﬁcation Perspective. Cambridge
Univ Press. 2011
Outline
1 Performance parameters
Introduction
Performance parameters for binary classiﬁers
Performance parameters for continuous outputs
2 Performance estimation
3 Performance comparison
4 An example
Outline
1 Performance parameters
Introduction
Performance parameters for binary classiﬁers
Performance parameters for continuous outputs
2 Performance estimation
3 Performance comparison
4 An example
Context
• binary classiﬁers: Y(x) ∈ {0, 1} is the predicted class
label
• Y is obtained usually from some discriminant function
h(x) ∈ R: Y = I[h(X) ≥ θ]
• h(x) (be it margin, posterior probability, etc) can be
interpreted as a score
• let C be the true label (0 or 1): gold standard or
ground truth
• we assume symmetric loss
In medical applications...
• a classiﬁer is often called a test
• the class of interest usually refers to an abnormal
condition (e.g. "diseased")
• "positive test" indicates that the abnormal condition is
predicted
• tests:
• diagnostic: detect the presence of disease
• prognostic: predict a clinical outcome (e.g.
"recurrence" vs "non-recurrence")
• screening: a test is applied to a large population to
detect the presence of an abnormal condition with
low prevalence; it is usually followed by other tests
Confusion matrix
Gold standard
C = 0 C = 1
Y = 0 true negative false negative
Y = 1 false positive true positive
Goal
Estimate conditional and marginal probabilities.
Confusion matrix
Gold standard
C = 0 C = 1
Y = 0 true negative false negative
P[Y = 0|C = 0] P[Y = 0|C = 1] P[Y = 0]
Y = 1 false positive true positive
P[Y = 1|C = 0] P[Y = 1|C = 1] P[Y = 1]
P[C = 0] P[C = 1]
Goal
Estimate conditional and marginal probabilities.
• estimation is based on a ﬁnite test sample
{(Yi, Ci)|i = 1, . . . , n} i.i.d. drawn from the population
• the probabilities will be estimated in terms of
fractions/ proportions from the test sample
Confusion matrix based on the test sample:
Gold standard
C = 0 C = 1
Y = 0 n−
¯C
n−
C
n−
Y = 1 n+
¯C
n+
C
n+
n¯C nC n
C indicates the "positive class" (C = 1) and ¯C indicates
the "negative class" C = 0.
Notes on the sampling of the test set: the most frequent
ways of selecting the test set are
• i.i.d. from the underlying distribution → it means that it
also approximates well the class priors (prevalence);
in clinical studies this is called "cohort study"
• "case-control": a ﬁxed number of positive (cases) and
negative (controls) samples are randomly selected
from the population → the class priors are not
respected
In the following, i.i.d. sampling is implied, unless stated
otherwise.
Outline
1 Performance parameters
Introduction
Performance parameters for binary classiﬁers
Performance parameters for continuous outputs
2 Performance estimation
3 Performance comparison
4 An example
Basic performance parameters
• a performance parameter P is a random variable,
and we only estimate it as ˆP
• however, to simplify notation we will denote the
parameter simply as P even when referring to its
estimate - the meaning is clear from context
• error rate or proportion of misclassiﬁed samples:
Err = P[Y C] →
n+
¯C
+n−
C
n
• false positive fraction: FPF = P[Y = 1|C = 0] →
n+
¯C
n+
¯C
+n−
¯C
(aka 1-Speciﬁcity (Sp))
• true positive fraction: TPF = P[Y = 1|C = 1] →
n+
C
n+
C
+n−
C
(aka Sensitivity (Se))
• let ρ = P[C = 1] be the prevalence of the positive
cases
• then
Err = ρ(1 − TPF) + (1 − ρ) FPF
• Positive Predicted Value:
PPV = P[C = 1|Y = 1] →
n+
C
n+
C
+n+
¯C
• Negative Predicted Value:
NPV = P[C = 0|Y = 0] →
n−
¯C
n−
¯C
+n−
C
• perfect classiﬁer/test: PPV = NPV = 1
• totally uninformative classiﬁer/test:
PPV = ρ, NPV = 1 − ρ
•
PPV =
ρ TPF
ρ TPF +(1 − ρ) FPF
NPV =
(1 − ρ)(1 − FPF)
(1 − ρ)(1 − FPF) + ρ(1 − TPF)
• in information-retrieval applications: recall stands for
TPF and precision stands for PPV
• F−measure:
Fα =
(1 − α)(precision × recall)
α × precision + recall
• Matthews correlation coefﬁcient
MCC =
n+
C
× n−
¯C
− n+
¯C
× n−
C
(n+
C
+ n+
¯C
)(n+
C
+ n−
C
)(n−
¯C
+ n+
¯C
)(n−
¯C
+ n−
C
)
Correcting for chance...
• Example 1: let the prevalence of positive cases be
ρ = 0.75 and consider a classiﬁer that predicts "1" or "0"
with equal probabilities (ﬂip of a coin)
• simply by chance, the classiﬁer will be right in
0.5 × 0.75 = 0.375 proportion of cases
• Example 2: medical imaging: the true labels are not
known, but there is an expert producing a labelling
and the classiﬁer produces another set of labels
• how can we compare the two, taking into account
the concordances due to mere chance?
...using agreement statistics
• probability of observed agreement between classiﬁer
and the true labels: Po =
n−
¯C
+n+
C
n
• S−coefﬁcient is deﬁned as S = 2Po − 1
• by taking into account the chance agreement (Pe):
what is the ratio between the difference between
observed and expected chance agreement (Po − Pe)
and maximum possible agreement beyond chance:
Po − Pe
1 − Pe
• if the estimation of chance agreement is
Pe =
n+ + nC
2n
2
+
n− + n¯C
2n
2
the fraction is denoted as π = Po−Pe
1−Pe
and is called
Scott’s π coefﬁcient
• if the estimation of chance agreement is
Pe =
n+ × nC
n2
+
n− × n¯C
n2
the fraction is denoted as κ = Po−Pe
1−Pe
and is called
Cohen’s kappa coefﬁcient
• in medical applications, κ is usually used for
measuring the agreement between an expert and
another system
Conﬁdence intervals (CI)
• need ways for characterizing the uncertainty in the
estimates
• informally, CI is a measure of reliability of the
estimates; sample-based (observed)
• conﬁdence level: how often the conﬁdence interval
contains the estimated value
• the values within the CI can be seen as alternative
estimates of the parameter of interest
• smaller the sample size, larger the CI
• the (TPF, FPF) and (PPV, NPV) are r.v. from a Bernoulli
trial
(
• Bernoulli trial: experiment with a random binary
outcome
• binomial distribution: discrete pdf of the number of
successes in n independent Bernoulli trials with
success probability p
• X ∼ B(n, p) :
P[X = k] =
n
k
pk
(1 − p)n−k
E[X] = np
Var[X] = np(1 − p)
• as n → ∞,
X − np
np(1 − p)
∼ N(0, 1)
)
• simplest CI: normal approximation: a 1 − α CI for the
binomial parameter p (proportion of successes
(between 0 and 1) in n trials) is
p ± z1−α/2
p(1 − p)
n
where z1−α/2 is the 1 − α/2 percentile of standard
normal distribution (e.g. for 95% CI, α = 0.05 and
z0.975 ≈ 1.96)
Warning
The normal approximation is poor for FPF or TPF close to 0
or 1.
Agresti-Coull 1 − α CI:
• let n be the number of trials and p the number of
successes, then let ˜n = n + z2
1−α/2
and
˜p = 1
˜n
p + 1
2
z2
1−α/2
, then a good approximation for
the CI is
˜p ± z1−α/2
˜p(1 − ˜p)
˜n
• other formulas for CI: Wilson score intervals;
Clopper–Pearson interval, Bayesian CIs
Example: a test for predicting pCR in breast cancer yields
pCR=0 pCR=1
predicted 0 61 5
predicted 1 24 10
TPF = 0.67, FPF = 0.28
PPV = 0.29, NPV = 0.92
95% conﬁdence intervals:
• normal approx.: FPF ∈ (0.197, 0.391), TPF ∈ (0.428, 0.905)
• Wilson: FPF ∈ (0.208, 0.398), TPF ∈ (0.417, 0.848)
• Bayesian: FPF ∈ (0.205, 0.397), TPF ∈ (0.416, 0.860)
Joint conﬁdence intervals
00001111
000
00000
0
11
1111
111
0000011111
FPF
TPF
Joint conﬁdence intervals
00001111
000
00000
0
11
1111
111
0000011111
FPF
TPF
• what is the joint 100(1 − α)% conﬁdence region for
(FPF, TPF)?
Joint conﬁdence intervals
00001111
000
00000
0
11
1111
111
0000011111
FPF
TPF
• what is the joint 100(1 − α)% conﬁdence region for
(FPF, TPF)?
Joint conﬁdence intervals
00001111
000
00000
0
11
1111
111
0000011111
FPF
TPF
• what is the joint 100(1 − α)% conﬁdence region for
(FPF, TPF)?
Rectangular conﬁdence regions
If (Plow , Pup) and (Qlow , Qup) are the 1 − α∗ univariate
conﬁdence intervals for two binomial random variables P
and Q, then the rectangle
R ≡ (Plow , Pup) × (Qlow , Qup)
is a (1 − α) conﬁdence region for (P, Q), where
α = 1 − (1 − α∗)2.
Rectangular conﬁdence regions
If (Plow , Pup) and (Qlow , Qup) are the 1 − α∗ univariate
conﬁdence intervals for two binomial random variables P
and Q, then the rectangle
R ≡ (Plow , Pup) × (Qlow , Qup)
is a (1 − α) conﬁdence region for (P, Q), where
α = 1 − (1 − α∗)2.
Examples:
• 95% univariate CI lead to a 90.25% conﬁdence region
• for a 95% conﬁdence region, 97.5% univariate CIs are
needed
Outline
1 Performance parameters
Introduction
Performance parameters for binary classiﬁers
Performance parameters for continuous outputs
2 Performance estimation
3 Performance comparison
4 An example
A motivating example
Using (FPF, TPF) for comparing tests:
0,0
0,1
1,0
TPF,Se
FPF, 1−Sp
DC
A B
• single point performance measure: partition the
space in 4 regions
• region A: better than current test
• region D: worse than current test
• regions B,C: less clear
Other issues with single point performance metrics:
• difﬁculty in selecting the optimal threshold: different
context may need different operating regimes
• additive batch effects may spoil the single–point
performance
ROC curves: Theory
Negative Positive
S(X)
0,0
0,1
1,0
FPF, 1−Sp
TPF,Se
t
• continuous test score
Y = S(x) (could be
margin h(x))
• FPF(t) = P[y ≥ t|C = 0]
• TPF(t) = P[Y ≥ t|C = 1]
• ROC =
{(FPF(t), TPF(t))|∀t ∈ R}
• limt→∞ FPF(t) =
limt→∞ TPF(t) = 0
• limt→−∞ FPF(t) =
limt→−∞ TPF(t) = 1
Properties of the ROC curves:
• monotone increasing function
• ROC curve is invariant to strictly increasing
transformations of the scores Y = ψ(S(x))
• parametric model:
ROC = (α, TPF(FPF−1
(α)))|∀α ∈ (0, 1)
• ROC(0) = 0, ROC(1) = 1, and
∂ ROC(t)
∂t
=
fC(FPF−1
(t))
f¯C(FPF−1
(t))
,
where fC and f¯C are the probability densities of the
scores within diseased and healthy populations,
respectively.
• the ROC curve describes the relationship between
the two distributions, and is independent of them
Note that
∂ ROC(t)
∂t
=
P[Y = t|C = 1]
P[Y = t|C = 0]
= LR(t)
→ the likelihood ratio at threshold t.
• if LR is monotonically increasing, then the
classiﬁcation rule of the form LR > t is optimal
• the ROC curve based on LR is uniformly above all
other curves
• the optimal ROC curve is concave; ⇒ its slope is a
monotone decreasing function
Summary indices
Area under the ROC curve (AUC):
AUC =
1
0
ROC(t)dt
Properties:
• 0.5 ≤ AUC ≤ 1
• AUC = P[YC > Y¯C] → the
probability of correctly ordering a
random pair of cases
(Mann–Whitney–Wilcoxon
U–statistic) 00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
10000011111
0,0 1,0
0,1
TPF(t)=P[Y>t|C=1]
FPF(t)=P[Y>t|C=0]
P[Y>t|C=1]
t t−d
P[Y>t−d,Y<t|C=0]
Summary indices
Area under the ROC curve (AUC):
AUC =
1
0
ROC(t)dt
Properties:
• 0.5 ≤ AUC ≤ 1
• AUC = P[YC > Y¯C] → the
probability of correctly ordering a
random pair of cases
(Mann–Whitney–Wilcoxon
U–statistic)
• AUC =
1
0
TPF(FPF−1
(t))dt =
−
∞
−∞
TPF(t)d FPF(t)
00000
00000
00000
00000
00000
00000
00000
00000
00000
00000
11111
11111
11111
11111
11111
11111
11111
11111
11111
11111
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
10000011111
0,0 1,0
0,1
TPF(t)=P[Y>t|C=1]
FPF(t)=P[Y>t|C=0]
P[Y>t|C=1]
t t−d
P[Y>t−d,Y<t|C=0]
The binormal ROC curve
Assuming normal distributions for the scores:
YC ∼ N(µC, σ2
C); Y¯C ∼ N(µ¯C, σ2
¯C
),
ROC becomes:
ROC(t) = Φ
µC − µ¯C
σC
+
σ¯C
σC
Φ−1
(t)
The binormal ROC curve
Assuming normal distributions for the scores:
YC ∼ N(µC, σ2
C); Y¯C ∼ N(µ¯C, σ2
¯C
),
ROC becomes:
ROC(t) = Φ
µC − µ¯C
σC
+
σ¯C
σC
Φ−1
(t)
General form
ROC(t) = Φ(α + βΦ−1
(t))
where α, β > 0 and Φ is the standard normal CDF.
Properties:
• AUC = Φ α√
1+β2
• binormal assumption: there exists some monotone
strictly increasing function h(·) which makes YC and
Y¯C normally distributed
• if the ROC is binormal, ROC(t) = Φ(α + βΦ−1(t)), then
h(s) = −Φ−1(FPF(s)) transforms the scores YC and Y¯C
into normally distributed random variables.
Empirical estimates of ROC
ROCe(t) = TPF(FPF−1
(t)) :
TPF(t) =
nC
i=1
I[YCi ≥ t]
FPF(t) =
n¯C
i=1
I[Y¯Ci ≥ t]
Empirical estimate
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
“ROC” for single threshold
Empirical estimates of AUC
Mann–Whitney–Wilcoxon U–statistic:
AUCe =
1
nCn¯C
nC
i=1
n¯C
j=1
I[YCi > Y¯Cj] + 0.5I[YCi = Y¯Cj]
Note: if only one point in the (FPF, TPF) space is given,
AUC = 0.5(1 + TPF − FPF).
AUC: sampling variability
Var(AUCe) =
1
nCn¯C
[AUC(1 − AUC) + (nC − 1)(Q1 − AUC2
)
+ (n¯C − 1)(Q2 − AUC2
)]
where
Q1 = P[YCi ≥ Y¯Cj, YCk ≥ Y¯Cj]
Q2 = P[YCi ≥ Y¯Cj, YCi ≥ Y¯Ck].
Semi–parametric models
Start from
ROC(t) = TPF(FPF−1
(t|α)|β)
and assume some parametric form for TPF and FPF for
which estimate the parameters from data.
Ex. of semi–parametric model:
YCi = µC + σCεi
Y¯Ci = µ¯C + σ¯Cεi
where ε have mean 0 and variance 1 and follow some
distribution function S.
S(t) =
1
nC + n¯C



i
I
YCi − µC
σC
≥ t +
i
I
Y¯Ci − µ¯C
σ¯C
≥ t



which leads to
ROC(t) = S (µ¯C − µC)/σC + (σ¯C/σC)S−1
(t)
Ex: empirical vs. semi-parametric
estimation
0.0 0.5 1.0
Score densities
Scores
Density
non-diseased
diseased
Semi-parametric vs empirical estimates
False positive rate
Truepositiverate
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
empirical
parametric
AUCe ≈ 0.7475; AUCsp ≈ 0.7418
Outline
1 Performance parameters
Introduction
Performance parameters for binary classiﬁers
Performance parameters for continuous outputs
2 Performance estimation
3 Performance comparison
4 An example
Why estimation?
• ﬁnite training data
• no formula for CI without distribution assumptions
• often, a single data set is available for both model
building and performance measuring
• performance estimated on the modeling data is
optimistically biased
Idea
Split (maybe repeatedly) the available data into a
training and a validation set, and assess the performance
only on the data that has not been used in building the
model.
WARNING
All the processing steps that depend on the sampling and
which lead to the ﬁnal model, MUST BE REPEATED
IDENTICALLY ON EVERY TRAIN-VALIDATION SPLIT!
This includes, but is not limited to: data normalization,
feature selection, classiﬁer training, meta-parameter
optimization.
Notes:
• any two training sets generated from the full data set
by resampling will usually overlap to some extent →
the models are not totally independent
• the variability is usually under-estimated
• the procedure is easy to be parallelized, but attention
must be paid to the parallel RNG (to avoid repeating
the same sequences)
Resampling methods
• simple split–sample approach
• k–fold cross–validation
• Monte–Carlo cross-validation
• repeated k–fold cross–validation
• leave–one–out
• bootstrapping
• ...
k−fold cross–validation
• separated train and test
sets
• randomly dived data
into k subsets (folds) –
you may also choose to
enforce the proportion
of the classes (stratiﬁed
CV)
• train on k − 1 folds and
test on the holdout fold
• estimate the error as the
average error measured
on holdout folds
TRAIN SET TEST SET
• usually k = 5 or k = 10
• if k = n ⇒
leave–one–out estimator
• improved estimation:
repeated k−CV (e.g.
100 × (5 − CV))
k−fold cross–validation
From k folds:
• 1, . . . , k errors on the test folds (any other
performance parameter)
• ˆEk−CV = 1
k
k
j=1 j
• estimated standard deviation
Conﬁdence intervals (simple version – normal
approximation):
E ≈ ˆE ±


0.5
n
+ z
ˆE(1 − ˆE)
n


where n is the dataset size and z = Φ−1(1 − α/2), for a 1 − α
conﬁdence interval (e.g. z = 1.96 for 95% conf. interval)
Bootstrap error estimation
Performance estimation
(X,Y)
(X1,Y1) (XB,YB)
E1 E2 EB
...(X2,Y2)
1 generate a new dataset (Xb, Yb) by resampling with
replacement from the original dataset (X, Y)
2 train the classiﬁer on (Xb, Yb) and test on the left out
data, to obtain an error ˆEb.
3 repeat 1–2 for b = 1, . . . , B and collect ˆEb.
Bootstrap error estimation
• estimate the error: for example, use the .632 estimator
ˆE = 0.368E0 + 0.632
1
B
B
b=1
ˆEb
where E0 is the error rate on the full training set (X, Y).
• use the empirical distribution of ˆEb to obtain
conﬁdence intervals
LPO bootstrap
Classiﬁcation rule:
ˆh(x)
C
¯C
θ
where ˆh is the estimated log-likelihood ratio and C and ¯C
are the class labels.
Empirical AUC (conditioned on the training set) can be
approximated by:
AUC =
1
n1n2
n2
j=1
n1
i=1
ψ ˆh(xi|C), ˆh(xj| ¯C)
where ψ is the Mann-Whitney kernel,
ψ(a, b) =



1 a > b
1
2
a = b
0 a < b
Yousef et al., Estimating the uncertainty in the estimated mean area under the ROC curve of a classiﬁer,
PRL 2005
Estimation of the expected AUC by LPO bootstrap:
AUC
LPO
=
1
n1n2
n2
j=1
n1
i=1
AUCi,j
AUCi,j =
B
b=1 Ib
j
Ib
i
ψ(ˆhb(xi), ˆhb(xj))
B
b=1 Ib
j
Ib
I
When 2 independent data sets are available, one can
estimate:
• the expected value of the conditional AUC:
expectation over the population of training sets of the
same size;
• variability of the performance estimate due to ﬁnite
train set;
• variability of the performance estimate due to ﬁnite
validation sets;
Yousef et al., Assessing classiﬁers from two independent data sets using ROC analysis: a nonparametric
approach, PAMI 2006
Conclusions
What we do learn from CV (and related):
• the expected performance of the modeling recipe;
• the imprecision in estimating the performance;
• we can have a look at:
• what are the most stable features
• what are the points always missclassiﬁed
Conclusions
What we do learn from CV (and related):
• the expected performance of the modeling recipe;
• the imprecision in estimating the performance;
• we can have a look at:
• what are the most stable features
• what are the points always missclassiﬁed
What we do not learn from CV:
• the best features
• the best classiﬁer
• the best
meta–parameters
We obtain these by training
on the full dataset (no CV).
Outline
1 Performance parameters
Introduction
Performance parameters for binary classiﬁers
Performance parameters for continuous outputs
2 Performance estimation
3 Performance comparison
4 An example
General considerations:
• comparison of methods/algorithms or models?
• let there be two models M1 and M2 and a
performance parameter P
• what differences are relevant?
• proper planning of the experimental design
• hypothesis testing (equivalence/difference and
inferiority/superiority):
H0 : there is no difference in performance
P(M1) = P(M2)
H1 : P(M1) P(M2) (two sided test) or
H1 : P(M1) P(M2) (single sided or
inferiority/superiority test)
• informally, one can check the overlap between CIs
• ideally, one would have a very large test set for
comparison
In everyday applications...
• one has limited data → use the resampling (like
cross-validation) for testing
• let P11, . . . , P1K be the performance of the 1st model
on the K test sets and P21, . . . , P2K the performance of
the 2nd model on the same K test sets
• simple tests: paired t−test and Wilcoxon signed rank
test
• warning: variability is underestimated,hence t-test has
inﬂated Type I error; there is a "corrected t-test" to
alleviate the problem
• the two samples {P1·} and {P2·} are not independent!
McNemar’s test
• consider a single test set of size m, on which both
models are applied
• the following contingency table is constructed:
Model M2
0 1
Model M1
0 c00 c01
1 c10 c11
with
• c00 counting how many times both models
misclassiﬁed the same sample
• c11 counting how many times both models correctly
classiﬁed the same sample
• c10 and c01 counting how many times M1 correctly
classiﬁed a sample the M2 misclassiﬁed, and
vice-versa
• McNemar’s test: H0 both classiﬁers have the same
performance (same error rates)
• construct the test statistic
χ2
Mc =
(|c01 − c10| − 1)2
c01 + c10
• χ2
Mc
has an approximate χ2 distribution with 1 df
• χ2
Mc
is to be compared with χ2
1,1−α
values for 1 − α
signiﬁcance level
• rule-of-thumb: the test needs a sample size large
enough such that c01 + c10 is at least 30
Wrap-up
• many performance parameters, depend on the
intended usage
• performance estimation is a key step of classiﬁer
building process
• pay attention of proper application of resampling
methods for performance estimation
• always (ALWAYS!) report the uncertainty in the
estimates
• classiﬁer performance comparison depends, again,
on the intended application
• McNemar’s test and CIs provide indications on
performance differences
Outline
1 Performance parameters
Introduction
Performance parameters for binary classiﬁers
Performance parameters for continuous outputs
2 Performance estimation
3 Performance comparison
4 An example
MAQC-II:
• ∼ 300 participants, from 5 countries:
• data providers
• data analysis teams (DATs): 36 teams, (∼ 100 people)
• regulatory board (mainly FDA)
• 6 datasets, 13 endpoints
• > 30000 “models”
• each Data Analysis Plan (DAP) is peer-reviewed
• each DAT selects a single candidate model for each
endpoint
• MAQC-II consortium selects 2 models for each
endpoint, before the release of the validation sets
DAP
Constraints:
• should be generally applicable, independent on
dataset/endpoint
• trade-off: undestandability/reproducibility vs.
performance/complexity
• the models should make single-chip predictions
Solution:
• use MAS5 for normalization
• favor “simple” classiﬁers
• nested 10 × 5 − CV
• use AUC as main performance criterion
Normalization (MAS5)
CV performance estimation
Feature selection/ranking
Classifier design
CV: performance estimation
Model optimization
Classifier design
Feature selection/ranking
Quality Control
Test on external data
CV: performance estimation
• classiﬁers: DLDA, LDA,
k−NN, CART, logistic
regression
• meta-parameters:
number of features, k, ...
• inner CV: optimize the
meta-parameter
• outer CV: estimate the
performance of the
system
Some performance results
Estimated vs. validation performance (AUC)
A B C D E F G H I J K L M
0.00.20.40.60.81.0
AUC estimated by 10x5-CV
Endpoint
AUC
A B C D E F G H I J K L M
0.00.20.40.60.81.0
Validation AUC
Endpoint
AUC
Estimation bias
A B C D E F G H I J K L M
-1.0-0.50.00.51.0
CV - Validation AUC
Endpoint
deltaAUC
-1.0-0.50.00.51.0
CV - Validation AUC
Organization
deltaAUC