PA196: Pattern Recognition 11. Additional topics Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno Outline 1 Model selection Bias-variance trade-off Some methods for model selection 2 Learning curves Outline 1 Model selection Bias-variance trade-off Some methods for model selection 2 Learning curves Bias-variance decomposition Hastie et al. Elements of Statistical Learning, fig.7.1 [Penalized regression with various model complexity parameters] • assume some training set Z = {(xi, yi)|i = 1, . . . , n} has been drawn i.i.d. from the underlying fixed distribution P(X, Y) • let L(y, h(x)) be the loss function (h is the classifier function) • training error (apparent error): Err0 = 1 n n i=1 L(yi, h(xi)) • generalization error: the prediction error over a test set: ErrZ = E[L(Y, h(X))|Z] • expected prediction error (expected loss): Err = E[ErrZ] = E[L(Y, h(X))] Loss functions (some): • 0-1 loss: L(y, h(x)) = 1y h(x) • squared loss: L(y, h(x)) = (y − h(x))2 • in the context of MAP: for K groups/classes, 1, . . . , K pk (X) = Pr(Y = k|X) and the classifier is (a monotone transformation of) the estimate ˆph. Then an adequate loss is L(Y, ˆpk ) = −2 K k=1 1Y=k log ˆpk (X) = −2 log ˆpY (X) = −2 × log-likelihood ("-2" is used to make above loss equivalent to squared loss under Gaussian distributions) Under some model Y = h(X) + , where is some noise (E[ ] = 0, Var[ ] = σ2 , and using squared loss, the error at a given point x0 can be written as Err(x0) = E[(Y − h(X))2 |X = x0] = σ2 + [E[h(x0)] − Y]2 + E[h(x0) − E[h(x0)]]2 = σ2 + Bias2 + Variance • σ2 cannot be influenced by the model • the bias: difference between true value and predicted value • the variance: the expected squared deviation of prediction from its mean • too complex models: low bias, high variance • too simple models: high bias, low variance Model complexity • in some cases, it is easy to quantify the model complexity • k−NN: 1/k is a measure of complexity • for a linear model h(x) = w, x , the complexity is directly related to the number of non-zero coefficients • SVM: VC-dimension can be interpreted as a measure of complexity • "Occam’s razor" principle (lex parsimoniae): among competing hypotheses/explanations the "simpler" one (with fewest assumptions) should be preferred Example: Hastie et al. Elements of Statistical Learning, fig.7.2 Expected error Bias 2 Variance Outline 1 Model selection Bias-variance trade-off Some methods for model selection 2 Learning curves • idea: compute some fitness indicator for a series of models and choose the "best" one • for classifiers, the most used fitness indicator is the classification performance → estimate the error rate or AUC or any other performance parameter, for a series of values of meta-parameter(s) and choose the one with lowest error rate (or highest AUC, etc). E.g.: grid search we used for SVM • alternative: try to balance the model complexity and its fitness: AIC, BIC, MDL AIC - Akaike’s Information Criterion In general, for log-likelihood maximization criterion for model fitting, AIC = − 2 n × log-likelihood + 2 d n where "log-likelihood" is the fitted (maximized) log-likehood (by the classifier) and d is a measure of complexity (degrees of freedom) of the model. The best model (in AiC-sense): the one that minimizes AIC. AIC - for classifiers AIC(α) = Err0(α) + 2 d(α) n ˆσ2 • α is some meta-parameter of the classifier (e.g. polynomial degree for a polynomial kernel SVM, or k in k−NN etc.) • d(α) is the corresponding complexity • AIC(α) is an estimate of the test error curve • best model (in AIC-sense) is the one with α minimizing AIC(α) • ˆσ2 can be estimated from mean squared error of a low-bias model Example Hastie et al. Elements of Statistical Learning, fig.7.3 BIC - Bayesian Information Criterion In general, for log-likelihood-maximization settings, BIC = −2 × log-likelihood + d log n which, for a classifier, can be written as BIC = n ˆσ2 Err0 + log n · d n ˆσ2 • BIC penalizes more heavily complex models (than AIC) • the best model (in BIC sense) is the one that minimizes BIC MDL - Minimum Description Length • MDL leads to a formally identical criterion to BIC, but comes from a totally different theoretical framework • the classifier is seen as an encoder of the message (data) • the model is the encoded message to be transmitted - hence we want it to be parsimonious (sparse) and with limited information loss Outline 1 Model selection Bias-variance trade-off Some methods for model selection 2 Learning curves Learning curves • a "diagnostic" for classifier training • can be used to estimate/approximate the sample size needed for a given problem n (sample size) Error Test error Train error Example: Popovici et al, Effect of training-sample size..., BCR 2010 • breast cancer gene expression data • problems: prediction of ER status, pCR and pCR within ER• the performance (AUC) is estimated for increasing sample size • the following learning curve model is fit (Fukunaga): AUC(n) = a + b/n