PA196: Pattern Recognition 04. Classifier performance: parameters, estimation and comparison Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno Specific bibliography • W.J. Krzanowski, D.J. Hand: ROC curves for continuous data. CRC Press. 2009 • M.S. Pepe: The statistical evaluation of medical tests for classification and prediction. Oxford Univ Press. 2003 • N. Japkowicz, M. Shah. Evaluating Learning Algorithms: A Classification Perspective. Cambridge Univ Press. 2011 Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Context • binary classifiers: Y(x) ∈ {0, 1} is the predicted class label • Y is obtained usually from some discriminant function h(x) ∈ R: Y = I[h(X) ≥ θ] • h(x) (be it margin, posterior probability, etc) can be interpreted as a score • let C be the true label (0 or 1): gold standard or ground truth • we assume symmetric loss In medical applications... • a classifier is often called a test • the class of interest usually refers to an abnormal condition (e.g. "diseased") • "positive test" indicates that the abnormal condition is predicted • tests: • diagnostic: detect the presence of disease • prognostic: predict a clinical outcome (e.g. "recurrence" vs "non-recurrence") • screening: a test is applied to a large population to detect the presence of an abnormal condition with low prevalence; it is usually followed by other tests Confusion matrix Gold standard C = 0 C = 1 Y = 0 true negative false negative Y = 1 false positive true positive Goal Estimate conditional and marginal probabilities. Confusion matrix Gold standard C = 0 C = 1 Y = 0 true negative false negative P[Y = 0|C = 0] P[Y = 0|C = 1] P[Y = 0] Y = 1 false positive true positive P[Y = 1|C = 0] P[Y = 1|C = 1] P[Y = 1] P[C = 0] P[C = 1] Goal Estimate conditional and marginal probabilities. • estimation is based on a finite test sample {(Yi, Ci)|i = 1, . . . , n} i.i.d. drawn from the population • the probabilities will be estimated in terms of fractions/ proportions from the test sample Confusion matrix based on the test sample: Gold standard C = 0 C = 1 Y = 0 n− ¯C n− C n− Y = 1 n+ ¯C n+ C n+ n¯C nC n C indicates the "positive class" (C = 1) and ¯C indicates the "negative class" C = 0. Notes on the sampling of the test set: the most frequent ways of selecting the test set are • i.i.d. from the underlying distribution → it means that it also approximates well the class priors (prevalence); in clinical studies this is called "cohort study" • "case-control": a fixed number of positive (cases) and negative (controls) samples are randomly selected from the population → the class priors are not respected In the following, i.i.d. sampling is implied, unless stated otherwise. Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Basic performance parameters • a performance parameter P is a random variable, and we only estimate it as ˆP • however, to simplify notation we will denote the parameter simply as P even when referring to its estimate - the meaning is clear from context • error rate or proportion of misclassified samples: Err = P[Y C] → n+ ¯C +n− C n • false positive fraction: FPF = P[Y = 1|C = 0] → n+ ¯C n+ ¯C +n− ¯C (aka 1-Specificity (Sp)) • true positive fraction: TPF = P[Y = 1|C = 1] → n+ C n+ C +n− C (aka Sensitivity (Se)) • let ρ = P[C = 1] be the prevalence of the positive cases • then Err = ρ(1 − TPF) + (1 − ρ) FPF • Positive Predicted Value: PPV = P[C = 1|Y = 1] → n+ C n+ C +n+ ¯C • Negative Predicted Value: NPV = P[C = 0|Y = 0] → n− ¯C n− ¯C +n− C • perfect classifier/test: PPV = NPV = 1 • totally uninformative classifier/test: PPV = ρ, NPV = 1 − ρ • PPV = ρ TPF ρ TPF +(1 − ρ) FPF NPV = (1 − ρ)(1 − FPF) (1 − ρ)(1 − FPF) + ρ(1 − TPF) • in information-retrieval applications: recall stands for TPF and precision stands for PPV • F−measure: Fα = (1 − α)(precision × recall) α × precision + recall • Matthews correlation coefficient MCC = n+ C × n− ¯C − n+ ¯C × n− C (n+ C + n+ ¯C )(n+ C + n− C )(n− ¯C + n+ ¯C )(n− ¯C + n− C ) Correcting for chance... • Example 1: let the prevalence of positive cases be ρ = 0.75 and consider a classifier that predicts "1" or "0" with equal probabilities (flip of a coin) • simply by chance, the classifier will be right in 0.5 × 0.75 = 0.375 proportion of cases • Example 2: medical imaging: the true labels are not known, but there is an expert producing a labelling and the classifier produces another set of labels • how can we compare the two, taking into account the concordances due to mere chance? ...using agreement statistics • probability of observed agreement between classifier and the true labels: Po = n− ¯C +n+ C n • S−coefficient is defined as S = 2Po − 1 • by taking into account the chance agreement (Pe): what is the ratio between the difference between observed and expected chance agreement (Po − Pe) and maximum possible agreement beyond chance: Po − Pe 1 − Pe • if the estimation of chance agreement is Pe = n+ + nC 2n 2 + n− + n¯C 2n 2 the fraction is denoted as π = Po−Pe 1−Pe and is called Scott’s π coefficient • if the estimation of chance agreement is Pe = n+ × nC n2 + n− × n¯C n2 the fraction is denoted as κ = Po−Pe 1−Pe and is called Cohen’s kappa coefficient • in medical applications, κ is usually used for measuring the agreement between an expert and another system Confidence intervals (CI) • need ways for characterizing the uncertainty in the estimates • informally, CI is a measure of reliability of the estimates; sample-based (observed) • confidence level: how often the confidence interval contains the estimated value • the values within the CI can be seen as alternative estimates of the parameter of interest • smaller the sample size, larger the CI • the (TPF, FPF) and (PPV, NPV) are r.v. from a Bernoulli trial ( • Bernoulli trial: experiment with a random binary outcome • binomial distribution: discrete pdf of the number of successes in n independent Bernoulli trials with success probability p • X ∼ B(n, p) : P[X = k] = n k pk (1 − p)n−k E[X] = np Var[X] = np(1 − p) • as n → ∞, X − np np(1 − p) ∼ N(0, 1) ) • simplest CI: normal approximation: a 1 − α CI for the binomial parameter p (proportion of successes (between 0 and 1) in n trials) is p ± z1−α/2 p(1 − p) n where z1−α/2 is the 1 − α/2 percentile of standard normal distribution (e.g. for 95% CI, α = 0.05 and z0.975 ≈ 1.96) Warning The normal approximation is poor for FPF or TPF close to 0 or 1. Agresti-Coull 1 − α CI: • let n be the number of trials and p the number of successes, then let ˜n = n + z2 1−α/2 and ˜p = 1 ˜n p + 1 2 z2 1−α/2 , then a good approximation for the CI is ˜p ± z1−α/2 ˜p(1 − ˜p) ˜n • other formulas for CI: Wilson score intervals; Clopper–Pearson interval, Bayesian CIs Example: a test for predicting pCR in breast cancer yields pCR=0 pCR=1 predicted 0 61 5 predicted 1 24 10 TPF = 0.67, FPF = 0.28 PPV = 0.29, NPV = 0.92 95% confidence intervals: • normal approx.: FPF ∈ (0.197, 0.391), TPF ∈ (0.428, 0.905) • Wilson: FPF ∈ (0.208, 0.398), TPF ∈ (0.417, 0.848) • Bayesian: FPF ∈ (0.205, 0.397), TPF ∈ (0.416, 0.860) Joint confidence intervals 00001111 000 00000 0 11 1111 111 0000011111 FPF TPF Joint confidence intervals 00001111 000 00000 0 11 1111 111 0000011111 FPF TPF • what is the joint 100(1 − α)% confidence region for (FPF, TPF)? Joint confidence intervals 00001111 000 00000 0 11 1111 111 0000011111 FPF TPF • what is the joint 100(1 − α)% confidence region for (FPF, TPF)? Joint confidence intervals 00001111 000 00000 0 11 1111 111 0000011111 FPF TPF • what is the joint 100(1 − α)% confidence region for (FPF, TPF)? Rectangular confidence regions If (Plow , Pup) and (Qlow , Qup) are the 1 − α∗ univariate confidence intervals for two binomial random variables P and Q, then the rectangle R ≡ (Plow , Pup) × (Qlow , Qup) is a (1 − α) confidence region for (P, Q), where α = 1 − (1 − α∗)2. Rectangular confidence regions If (Plow , Pup) and (Qlow , Qup) are the 1 − α∗ univariate confidence intervals for two binomial random variables P and Q, then the rectangle R ≡ (Plow , Pup) × (Qlow , Qup) is a (1 − α) confidence region for (P, Q), where α = 1 − (1 − α∗)2. Examples: • 95% univariate CI lead to a 90.25% confidence region • for a 95% confidence region, 97.5% univariate CIs are needed Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example A motivating example Using (FPF, TPF) for comparing tests: 0,0 0,1 1,0 TPF,Se FPF, 1−Sp DC A B • single point performance measure: partition the space in 4 regions • region A: better than current test • region D: worse than current test • regions B,C: less clear Other issues with single point performance metrics: • difficulty in selecting the optimal threshold: different context may need different operating regimes • additive batch effects may spoil the single–point performance ROC curves: Theory Negative Positive S(X) 0,0 0,1 1,0 FPF, 1−Sp TPF,Se t • continuous test score Y = S(x) (could be margin h(x)) • FPF(t) = P[y ≥ t|C = 0] • TPF(t) = P[Y ≥ t|C = 1] • ROC = {(FPF(t), TPF(t))|∀t ∈ R} • limt→∞ FPF(t) = limt→∞ TPF(t) = 0 • limt→−∞ FPF(t) = limt→−∞ TPF(t) = 1 Properties of the ROC curves: • monotone increasing function • ROC curve is invariant to strictly increasing transformations of the scores Y = ψ(S(x)) • parametric model: ROC = (α, TPF(FPF−1 (α)))|∀α ∈ (0, 1) • ROC(0) = 0, ROC(1) = 1, and ∂ ROC(t) ∂t = fC(FPF−1 (t)) f¯C(FPF−1 (t)) , where fC and f¯C are the probability densities of the scores within diseased and healthy populations, respectively. • the ROC curve describes the relationship between the two distributions, and is independent of them Note that ∂ ROC(t) ∂t = P[Y = t|C = 1] P[Y = t|C = 0] = LR(t) → the likelihood ratio at threshold t. • if LR is monotonically increasing, then the classification rule of the form LR > t is optimal • the ROC curve based on LR is uniformly above all other curves • the optimal ROC curve is concave; ⇒ its slope is a monotone decreasing function Summary indices Area under the ROC curve (AUC): AUC = 1 0 ROC(t)dt Properties: • 0.5 ≤ AUC ≤ 1 • AUC = P[YC > Y¯C] → the probability of correctly ordering a random pair of cases (Mann–Whitney–Wilcoxon U–statistic) 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 10000011111 0,0 1,0 0,1 TPF(t)=P[Y>t|C=1] FPF(t)=P[Y>t|C=0] P[Y>t|C=1] t t−d P[Y>t−d,Y Y¯C] → the probability of correctly ordering a random pair of cases (Mann–Whitney–Wilcoxon U–statistic) • AUC = 1 0 TPF(FPF−1 (t))dt = − ∞ −∞ TPF(t)d FPF(t) 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 10000011111 0,0 1,0 0,1 TPF(t)=P[Y>t|C=1] FPF(t)=P[Y>t|C=0] P[Y>t|C=1] t t−d P[Y>t−d,Y 0 and Φ is the standard normal CDF. Properties: • AUC = Φ α√ 1+β2 • binormal assumption: there exists some monotone strictly increasing function h(·) which makes YC and Y¯C normally distributed • if the ROC is binormal, ROC(t) = Φ(α + βΦ−1(t)), then h(s) = −Φ−1(FPF(s)) transforms the scores YC and Y¯C into normally distributed random variables. Empirical estimates of ROC ROCe(t) = TPF(FPF−1 (t)) : TPF(t) = nC i=1 I[YCi ≥ t] FPF(t) = n¯C i=1 I[Y¯Ci ≥ t] Empirical estimate False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 “ROC” for single threshold Empirical estimates of AUC Mann–Whitney–Wilcoxon U–statistic: AUCe = 1 nCn¯C nC i=1 n¯C j=1 I[YCi > Y¯Cj] + 0.5I[YCi = Y¯Cj] Note: if only one point in the (FPF, TPF) space is given, AUC = 0.5(1 + TPF − FPF). AUC: sampling variability Var(AUCe) = 1 nCn¯C [AUC(1 − AUC) + (nC − 1)(Q1 − AUC2 ) + (n¯C − 1)(Q2 − AUC2 )] where Q1 = P[YCi ≥ Y¯Cj, YCk ≥ Y¯Cj] Q2 = P[YCi ≥ Y¯Cj, YCi ≥ Y¯Ck]. Semi–parametric models Start from ROC(t) = TPF(FPF−1 (t|α)|β) and assume some parametric form for TPF and FPF for which estimate the parameters from data. Ex. of semi–parametric model: YCi = µC + σCεi Y¯Ci = µ¯C + σ¯Cεi where ε have mean 0 and variance 1 and follow some distribution function S. S(t) = 1 nC + n¯C    i I YCi − µC σC ≥ t + i I Y¯Ci − µ¯C σ¯C ≥ t    which leads to ROC(t) = S (µ¯C − µC)/σC + (σ¯C/σC)S−1 (t) Ex: empirical vs. semi-parametric estimation 0.0 0.5 1.0 Score densities Scores Density non-diseased diseased Semi-parametric vs empirical estimates False positive rate Truepositiverate 0.0 0.2 0.4 0.6 0.8 1.0 0.00.20.40.60.81.0 empirical parametric AUCe ≈ 0.7475; AUCsp ≈ 0.7418 Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example Why estimation? • finite training data • no formula for CI without distribution assumptions • often, a single data set is available for both model building and performance measuring • performance estimated on the modeling data is optimistically biased Idea Split (maybe repeatedly) the available data into a training and a validation set, and assess the performance only on the data that has not been used in building the model. WARNING All the processing steps that depend on the sampling and which lead to the final model, MUST BE REPEATED IDENTICALLY ON EVERY TRAIN-VALIDATION SPLIT! This includes, but is not limited to: data normalization, feature selection, classifier training, meta-parameter optimization. Notes: • any two training sets generated from the full data set by resampling will usually overlap to some extent → the models are not totally independent • the variability is usually under-estimated • the procedure is easy to be parallelized, but attention must be paid to the parallel RNG (to avoid repeating the same sequences) Resampling methods • simple split–sample approach • k–fold cross–validation • Monte–Carlo cross-validation • repeated k–fold cross–validation • leave–one–out • bootstrapping • ... k−fold cross–validation • separated train and test sets • randomly dived data into k subsets (folds) – you may also choose to enforce the proportion of the classes (stratified CV) • train on k − 1 folds and test on the holdout fold • estimate the error as the average error measured on holdout folds TRAIN SET TEST SET • usually k = 5 or k = 10 • if k = n ⇒ leave–one–out estimator • improved estimation: repeated k−CV (e.g. 100 × (5 − CV)) k−fold cross–validation From k folds: • 1, . . . , k errors on the test folds (any other performance parameter) • ˆEk−CV = 1 k k j=1 j • estimated standard deviation Confidence intervals (simple version – normal approximation): E ≈ ˆE ±   0.5 n + z ˆE(1 − ˆE) n   where n is the dataset size and z = Φ−1(1 − α/2), for a 1 − α confidence interval (e.g. z = 1.96 for 95% conf. interval) Bootstrap error estimation Performance estimation (X,Y) (X1,Y1) (XB,YB) E1 E2 EB ...(X2,Y2) 1 generate a new dataset (Xb, Yb) by resampling with replacement from the original dataset (X, Y) 2 train the classifier on (Xb, Yb) and test on the left out data, to obtain an error ˆEb. 3 repeat 1–2 for b = 1, . . . , B and collect ˆEb. Bootstrap error estimation • estimate the error: for example, use the .632 estimator ˆE = 0.368E0 + 0.632 1 B B b=1 ˆEb where E0 is the error rate on the full training set (X, Y). • use the empirical distribution of ˆEb to obtain confidence intervals LPO bootstrap Classification rule: ˆh(x) C ¯C θ where ˆh is the estimated log-likelihood ratio and C and ¯C are the class labels. Empirical AUC (conditioned on the training set) can be approximated by: AUC = 1 n1n2 n2 j=1 n1 i=1 ψ ˆh(xi|C), ˆh(xj| ¯C) where ψ is the Mann-Whitney kernel, ψ(a, b) =    1 a > b 1 2 a = b 0 a < b Yousef et al., Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier, PRL 2005 Estimation of the expected AUC by LPO bootstrap: AUC LPO = 1 n1n2 n2 j=1 n1 i=1 AUCi,j AUCi,j = B b=1 Ib j Ib i ψ(ˆhb(xi), ˆhb(xj)) B b=1 Ib j Ib I When 2 independent data sets are available, one can estimate: • the expected value of the conditional AUC: expectation over the population of training sets of the same size; • variability of the performance estimate due to finite train set; • variability of the performance estimate due to finite validation sets; Yousef et al., Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach, PAMI 2006 Conclusions What we do learn from CV (and related): • the expected performance of the modeling recipe; • the imprecision in estimating the performance; • we can have a look at: • what are the most stable features • what are the points always missclassified Conclusions What we do learn from CV (and related): • the expected performance of the modeling recipe; • the imprecision in estimating the performance; • we can have a look at: • what are the most stable features • what are the points always missclassified What we do not learn from CV: • the best features • the best classifier • the best meta–parameters We obtain these by training on the full dataset (no CV). Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example General considerations: • comparison of methods/algorithms or models? • let there be two models M1 and M2 and a performance parameter P • what differences are relevant? • proper planning of the experimental design • hypothesis testing (equivalence/difference and inferiority/superiority): H0 : there is no difference in performance P(M1) = P(M2) H1 : P(M1) P(M2) (two sided test) or H1 : P(M1) P(M2) (single sided or inferiority/superiority test) • informally, one can check the overlap between CIs • ideally, one would have a very large test set for comparison In everyday applications... • one has limited data → use the resampling (like cross-validation) for testing • let P11, . . . , P1K be the performance of the 1st model on the K test sets and P21, . . . , P2K the performance of the 2nd model on the same K test sets • simple tests: paired t−test and Wilcoxon signed rank test • warning: variability is underestimated,hence t-test has inflated Type I error; there is a "corrected t-test" to alleviate the problem • the two samples {P1·} and {P2·} are not independent! McNemar’s test • consider a single test set of size m, on which both models are applied • the following contingency table is constructed: Model M2 0 1 Model M1 0 c00 c01 1 c10 c11 with • c00 counting how many times both models misclassified the same sample • c11 counting how many times both models correctly classified the same sample • c10 and c01 counting how many times M1 correctly classified a sample the M2 misclassified, and vice-versa • McNemar’s test: H0 both classifiers have the same performance (same error rates) • construct the test statistic χ2 Mc = (|c01 − c10| − 1)2 c01 + c10 • χ2 Mc has an approximate χ2 distribution with 1 df • χ2 Mc is to be compared with χ2 1,1−α values for 1 − α significance level • rule-of-thumb: the test needs a sample size large enough such that c01 + c10 is at least 30 Wrap-up • many performance parameters, depend on the intended usage • performance estimation is a key step of classifier building process • pay attention of proper application of resampling methods for performance estimation • always (ALWAYS!) report the uncertainty in the estimates • classifier performance comparison depends, again, on the intended application • McNemar’s test and CIs provide indications on performance differences Outline 1 Performance parameters Introduction Performance parameters for binary classifiers Performance parameters for continuous outputs 2 Performance estimation 3 Performance comparison 4 An example MAQC-II: • ∼ 300 participants, from 5 countries: • data providers • data analysis teams (DATs): 36 teams, (∼ 100 people) • regulatory board (mainly FDA) • 6 datasets, 13 endpoints • > 30000 “models” • each Data Analysis Plan (DAP) is peer-reviewed • each DAT selects a single candidate model for each endpoint • MAQC-II consortium selects 2 models for each endpoint, before the release of the validation sets DAP Constraints: • should be generally applicable, independent on dataset/endpoint • trade-off: undestandability/reproducibility vs. performance/complexity • the models should make single-chip predictions Solution: • use MAS5 for normalization • favor “simple” classifiers • nested 10 × 5 − CV • use AUC as main performance criterion Normalization (MAS5) CV performance estimation Feature selection/ranking Classifier design CV: performance estimation Model optimization Classifier design Feature selection/ranking Quality Control Test on external data CV: performance estimation • classifiers: DLDA, LDA, k−NN, CART, logistic regression • meta-parameters: number of features, k, ... • inner CV: optimize the meta-parameter • outer CV: estimate the performance of the system Some performance results Estimated vs. validation performance (AUC) A B C D E F G H I J K L M 0.00.20.40.60.81.0 AUC estimated by 10x5-CV Endpoint AUC A B C D E F G H I J K L M 0.00.20.40.60.81.0 Validation AUC Endpoint AUC Estimation bias A B C D E F G H I J K L M -1.0-0.50.00.51.0 CV - Validation AUC Endpoint deltaAUC -1.0-0.50.00.51.0 CV - Validation AUC Organization deltaAUC