Gene expression-based classifiers
Vlad Popovici1,2
1
Bioinformatics Core Facility, Swiss Institute of Bioinformatics,
CH-1015 Lausanne, Switzerland; e-mail: vlad.popovici@isb-sib.ch
2
Institute of Biostatistics and Analyses, Masaryk University,
CZ-62500 Brno, Czech Republic
Abstract
Whole genome profiling and decreasing costs of genome sequencing enable measuring
the activity of tens of thousands of genes which can potentially be used for making
predictions about patients’ risk of relapse or response to a specific treatment. These
predictions are based on mathematical models that combine the measurements from a
selected set of genes into either a continuous score or a binary outcome. In order to
build such models that can be used in clinical practice with real benefits for the patients,
a rigorous methodological approach must be followed and the purpose of this chapter is
to briefly describe some theoretical considerations and practical results in the field of
gene expression-based classifiers.
Key words
Classifiers, biomarkers, performance estimation, model validation.
1. Introduction
The clinical practice has shown that many cancer treatments benefit only a small group of
patients who received them. Lacking precise means of identifying the patients most likely to
respond to a given treatment results in many patients being prescribed ineffective treatments,
which puts a serious burden on them and on the health care systems. The personalized
medicine addresses exactly this problem by trying to diagnose and treat a disease using
information about patient’s genes, proteins and environment. At the core of the diagnostic
and treatment decisions are placed the classifiers, which are mathematical models
assembling all the information into a system producing binary or multi-valued decisions. In
this context, the new treatments are accompanied by diagnostic tests which are supposed to
identify the most likely responders.
The problem of companion diagnostic tests is even more important in the case of targeted
therapies, which are “drugs or other substances that block the growth and spread of cancer
by interfering with specific molecules involved in tumor growth and progression” (NCI’s
Facts Sheet, http://www.cancer.gov/). As these drugs target specific molecular processes,
such as cell growth signaling, angiogenesis, apoptosis, or stimulate the immune response,
highly specific tests are needed to identify the right patient population.
1.1. What is a classifier in the context of genomic data
There are various names under which the classifiers appear in the literature related to gene
expression-based diagnostics and prognostics. They may be called “(multigene) expression
signature” or “(multigene) biomarkers” or simply “risk predictors/scores”. In general, we talk
about a classifier when we have in mind a model which produces a crisp decision (be it
binary or multi-level). While a score can be converted into a decision (see further on in this
chapter), and so “score” and “classifier” terms could be used interchangeably in some
context, a gene expression signature is usually not enough to specify a classifier. A gene
expression signature refers more to the genes selected to be specific to some phenotype, but
it normally does not specify the way these genes should be combined to predict the
phenotype in question. Also the term “biomarker” could be misleading, since it may also
refer to some markers that can be mechanistically linked to a disease activity. In conclusion,
we prefer the term “gene-based classifier” by which we mean a prediction model which
combines the gene expressions (and maybe other variables) in a model. This classifier may
have a score as an intermediate step towards decision.
2. Classifiers
Without any loss of generality, we will consider in the following the case of binary
classifiers, constructed on continuous variables and we will denote the two alternatives
(called classes) by “-1” and “+1”. Let : ℝ → ℝ be a real-valued function which will map a
vector x ∈ ℝ to a continuous score. A score = (x) is converted into a binary class label
by ℎ( ) = sign( ). Some classifiers will directly produce the binary label (e.g. the basic Top
Scoring Pairs algorithm, described later in this chapter), while others will firstly produce a
score. As the labels are easily obtained from the scores and since using continuous functions
is more convenient for modeling, we will generally focus on finding the function rather
than ℎ. We can then state the problem of learning a classification rule to be the task of
finding a real-valued function ∈ ℱ that maps each point of the input space (here considered
to be ℝ ) to a score that, after thresholding, will produce a binary label which will not differ
in too many cases from the true label. This formulation is too vague to be of any practical
utility as long as we do not specify
• which is the function space ℱ in which we search for the solution;
• what we mean by ‘differ’, and
• how many misclassified cases is ‘too many’ for a classifier to be considered good.
The choice of the function space ℱ is the first decision a data modeler has to take and, in
most cases, it means defining a parametric form for the score function . Let the parameters
on which depends be denoted by a -dimensional parameter vector ∈ Ω ⊆ ℝ . The
problem of training a classifier becomes an optimization problem in which one has to find
the optimal vector ∗
such that the expected risk of misclassification (expected prediction
error) is minimized:
∗
= arg	max !, (x)#	$%(x, !)		,
where is a loss function penalizing the discrepancies between the predicted label (x) and
the true label !. The integral is taken with respect to the probability density function % which
is generating the population &(x, !)|x ∈ ℝ , ! = ±1*. As the probability function is usually
not available, the risk is estimated from a finite training set (sample) given as a pair of sets
+,
=
&x., / = 1, … , 1* ⊂ ℝ and 3,
=
&!. = ±1, / = 1, … , 1* of points draw independently
and identically distributed from the underlying probability. In this case, the prediction error
can only be estimated from the finite sample, thus the estimation will become dependent on
the particular training set. This observation justifies the introduction of various error
estimation techniques, some briefly described in this chapter.
The loss function is of central importance in defining the form of the classifier and several
ways of penalizing the errors have been proposed in the literature (Hastie et al, 2009; Duda et
al, 2001). Here we will consider only the case of squared error loss,
!, x #
1
2
! 5 x #
6
which is, by far, the most commonly used. In the case of a risk of misclassification estimated
from a finite sample, we talk about empirical risk (of misclassification) and we estimate it by
its mean value over the given sample:
1
1
7 !., x.
,
.89
		,
which for squared error loss is simply the usual mean squared error,
9
,
∑ !. 5 x. #
6
. .
Figure 1 depicts a possible scenario for a binary classification problem in the case of ; 2,
with the solid line representing the classification boundary (i.e. the separation between the
two classes), defined by the equation x 0. Let us analyze the three proposed solutions:
in the first panel, the classifier perfectly separates the two classes, while the other two
solutions are simpler (smoother) classification functions, which misclassify some points. In
practice, it turns out that a function that perfectly separates the training set will usually
perform poorly on unseen data, i.e. the prediction will have high variance on different
samples drawn from the same underlying distribution % as the training data. We say in this
case that the first function overfits the training set. On the other hand, a too simplistic
explanation as the one given by the second classifier will never be able to satisfactory fit the
training data, i.e. the model chosen has a large bias. In this case we say that the model
underfits the training set. The central problem of machine learning is to find that right
tradeoff between underfitting and overfitting, that will generate functions able to
generalize well, i.e. their performance remains good on unseen data. We will further detail
what we mean by good performance of a classifier. This problem is also known as biasvariance
dilemma in classical statistics.
Figure 1. Three possible scenarios for a trained classifier: different degrees of regularization
lead to different solutions, with various performances.
To conclude this introductory section, we note that the genomic applications of classifiers
face a specific problem of fitting models in very high dimensional spaces (in general, ; ≫
1). Because of the high number of degrees of freedom of the learning problem, one can
always find a classifier that perfectly fits the training set, for any possible labeling. Or, to put
it in other terms, higher is the dimensionality of the space slower is the convergence of the
estimators (of the parameters) – phenomenon called curse of dimensionality. It follows that
one has either to use a very large learning set or to constrain the form of the classifier such
that it fits only the most salient characteristics of the two classes. The first solution is not
practically possible, so only the second approach remains feasible. Luckily, the variables
(genes) are not all independent and a large proportion of them are usually not important for
the classification task. This explains why, despite of the unfavorable settings, for many
applications one can find proper classifiers with reasonable performance.
2.1. Bayesian decision theory
Let us consider for a moment the best case scenario in which the classification problem is
completely specified by the probability functions:
• % ! >1 and % ! 51 are called prior probabilities (priors) and give the
probability of either of the classes, when no other information is available. In
general, for a binary classification problem, one excludes the possibility of
observing any other class but one of the two (! ∈ &(1*), so % ! 1 1 5
% ! 51 .
• class-conditional density functions, ; x|! 1 and ; x|! 51 , for x ∈ , the
probability density function of x, given that its label is “+1” (or “-1”).
Figure 2. Class conditional density functions for ESR1 gene expression as measured by one
probeset: ; 205225_at	|! "CD > " and ; 205225_at	|! "CD 5 " . The two classes are
estrogen-positive (ER+, red line) and estrogen-negative (ER-, blue line).
Using Bayes’ rule, it is easy to obtain the posterior probability
% ! (1|x 	=
;(x|! = ±1 	%(! = ±1
;(x
.
This shows that by observing the vector x (called evidence) and using information about
priors and class conditional densities – called likelihood – one can obtain the posterior
probability that the observed instance belongs to one of the classes. It follows naturally that,
for minimizing the risk of misclassification one must assign x to the class with maximum
posteriori probability (Bayes decision rule):
(x = E
−1, %(! = −1|x > %(! = +1|x
+1, %(! = −1|x ≤ %(! = +1|x
It is sometimes convenient to consider this rule in terms of log-ratio: assign x to class “+1” if
log
%(! = +1|x
%(! = −1|x
≥ 0,
and to class “-1” otherwise.
The Bayesian decision is optimal in the sense that it minimizes the probability of error, but it
requires full information about priors and class-conditional densities to be available.
However, this is not the case in real applications, and a plethora of approaches have been
proposed to deal with more realistic scenarios. One can try, for example, to consider a
parametric model for the probabilities (e.g. linear discriminant analysis, naïve Bayes
classifier, etc. etc.) or to use nonparametric estimators of the densities.
2.2. Linear discriminants
Suppose that the class-conditional densities are multivariate Gaussians,
;(x|! = ±1 =
1
(2K /6MΣ±9M
9/6
OP
9
6
xPQ±R#
S
T±R
UR
(xPQ±R
where V±9 are the mean vectors and Σ±9 are the covariance matrices of the two respective
classes (| ∙ | is the determinant operator). If the two classes have equal covariance matrices,
ΣP9 = ΣX9 = Σ, the log-ratio of the posteriors becomes
log
%(! = +1|x
%(! = −1|x
= log
%(! = +1
%(! = −1
−
1
2
(VP9 + VX9
Y
ΣP9(VX9 − VP9 + xY
ΣP9(VX9 − VP9 .
The values for the class means and the covariance matrix have to be estimated from the
training set, using the usual estimators. The priors are estimated by the class frequencies,
%Z(! = +1 = 1X9/1, and %Z(! = −1 = 1P9/1, where 1X/P9 are the number of elements in
each class. The above equation shows that the decision boundary between classes is a linear
function of x (for equal covariance matrices). By introducing the discriminant functions
[±9(x = xY
ΣP9
V±9 −
1
2
V±9
Y
ΣV±9 + log %(! = ±1
the decision rule ! = arg max\8±9 [\(x is equivalent to comparing the log-ratios of the
posteriors with 0. As a final remark, we note that $(x, V = (x − V Y
Σ(x − V is called
Mahalanobis distance from x to V (which becomes Euclidean distance if the covariance
matrix is the unit matrix) and that, under equal covariances assumption, the decision rule
assigns x to the class whose centroid (V) is the closest in the sense of this metric.
The squared error loss function, mentioned in the introduction of this chapter, is intimately
linked to LDA classifier as this can be derived from a linear regression model, where we fit a
linear model to the label variable, considered this time a continuous variable.
2.3. Nearest neighbor and related classifiers
The intuition behind the nearest neighbor and related methods is that an observation should
be assigned to the class containing other similar observations. The nearest neighbor
classifiers employ a voting scheme for deciding the class membership of a sample x ∈ ℝ .
The predicted (estimated) label is
!] = sign ^ 7 !.
_`∈ab(_)
c
where d\(x) is a neighborhood of e closest points to x. In other words, the predicted label !]
is the most common label among the e points in the neighborhood. If !] = 0 it means that the
point lies on the decision boundary and it has equal number of points from each class in its
neighborhood.
The notion of neighborhood implies the existence of a metric which, at its turn, is closely
related to the notion of similarity, in the sense that more similar observations are closer to
each other than observations less similar. In the case of ℝ , the natural metric is the
Euclidean distance, but this is not necessarily the best choice. For genomic applications, in
which the observations may be corrupted by high levels of noise, one may consider
alternative distances, for example
• correlation distance: $fg (x,z) = 1 − i(x,z)
• cosine distance: $fgj(x,z) = 1 −
〈x,z〉
‖x‖‖z‖
, where 〈⋅,⋅〉 denotes the scalar product of two
vectors, and ‖⋅‖ the 6 norm, respectively.
The parameter e ≥ 1 has to be optimized for each problem, usually by cross-validation (see
later on in this chapter). Smaller values will lead to a better fit of the training set, but may
have an adverse effect on the generalization properties of the classifier. Also, there is a direct
link between the parameter e and the smoothness of the decision boundary.
Instead of considering all the points in the data set in the decision rule, one may choose to
select only a few “representative” patterns from each class and to compute the distances only
to these points in order to classify a new observation. This is similar to LDA decision where,
as we have seen above, one computes the Mahalanobis distance to the centers of the two
classes and uses this information to classify the new observation. However, many other
strategies of choosing the “centers” of the classes (like averaging all class members, or
taking their median, for example) and distances to these centers can be employed, each
leading to a slightly modified version of the algorithm. This class of nearest centroid
classifiers is commonly employed in genomic applications because, despite not being
necessarily the best in term of performance, it generally leads to simple classification rules
that are readily interpretable and have reasonable performance.
2.4. Top scoring pairs
Top scoring pairs (TSPs) (Geman et al, 2004) are simple two-genes binary classifiers, in
which the prediction of the class label is based solely on the relative ranking of the
expression levels of the two genes. The rank--based approach to classification ensures a
higher degree of robustness to technical variations and makes the rule easily portable across
platforms. Also, the direct comparison of the expression level of the genes is easily
interpretable in the clinical context, making the TSPs attractive for medical tests.
Let again x op.q.89,…, ∈ be a vector of measurements (e.g. gene expression)
representing a sample and let the corresponding class label be ! (1. Then, for all pairs of
variables i and j, a score is computed,
.,r % p. s prM! 1# 5 % p. s prM! 51#,			1 G /, t G ;
where P are conditional probabilities and the corresponding decision rule is: if p. s pr then
predict ! 1, otherwise ! 51. The pair with the highest score or the top k pairs are then
considered for the final model (Geman et al, 2004; Tan et al, 2005).
Remarkably, this method does not require the optimization of any parameter and does not
depend on any threshold. Figure 3 shows an example of a TSP predicting the estrogen
receptor status. The decision boundary (in grey) is always a line with a slope of 1.
Figure 3. Predicting estrogen receptor status: if GSTP1 < ESR1, then the sample is
considered ER+ (red dots), otherwise ER- (blue dots).
3. Performance parameters and performance estimation
In the context of clinical applications, a classifier is seen as a test and its continuous value
x is called score. This score is discretized into two (binary tests) or more categories by
using a number of thresholds (or cut-offs) and a prediction about the patient is made based
on the predicted category. For example, a test can be used to predict if a patient has a given
disease (binary test), or to which of a number of risk groups he/she belongs (e.g. low,
medium and high risk groups – categorical tests). By convention, we will say that an
individual which is predicted to have the disease to be positive for the test.
Several categories of medical tests are more common:
• diagnostic tests are designed to detect the ‘diseased’ condition in a patient;
• prognostics tests try to predict an outcome of interest, like ‘recurrence’ vs. ‘no-
recurrence’;
• predictive tests are used to detect which patients may/may not respond to a
treatment; and
• screening tests are usually applied to a large population of normally healthy
individuals in which the disease has low prevalence, and are usually followed by
other confirmatory tests.
Each of these tests is designed to work in specific settings. For example, we require a
screening test to detect all (or, say 99%) of all diseased cases (must be sensitive), even if it
will produce a relatively high rate of false alarms (false positives). In contrast, a diagnostic
test must be sensitive and with low false positive rates. On the other hand, as we have seen
from the Bayes decision theory, the prior distributions influence the final decision. A
screening test is used in a population where the positive cases have a low prevalence: for
example, in 2009 breast cancer in UK had an age-standardized incidence of 124.4 cases in
100 000 women, so we can set a prior % ! $/ Ou O 0.125. A diagnostic test which is
applied to confirm a screening test will work on a population with a much higher incidence
of the disease, so a possible prior would be %(! = $/ Ou O) = 0.75. The screening and
diagnostic tests could be the same, with the only difference being the value of the threshold
for the score, above which we call a patient diseased. And this threshold is optimized based
on the prior probabilities.
In the following, we will briefly review some of the performance parameters that are used for
characterizing the classifiers. For a comprehensive treatment of the subject in the context of
clinical applications, see (Pepe, 2003).
3.1. Threshold-dependent performance parameters
We will use the following convention for calling the classes:
• true label (disease status) is denoted by w:
w = E
−1,		if	non-diseased
1, 	if	diseased
• predicted label is denoted by 3:
3 = E
−1,		if	negative	for	the	test
1,		if	positive	for	the	test
A continuous score (x) is converted into a prediction by sign( (x) − •), where • is a
threshold.
For a given observation x one of the following 4 situation may arise:
• w = 1, 3 = 1: true positive – the prediction and the true label are both indicating a
diseased case
• w = −1, 3 = −1: true negative – the prediction and the true label are both
indicating a non-diseased (healthy) case
• w = 1, 3 = −1: false negative – the test fails to detect the disease status
• w = −1, 3 = 1: false positive – the test predicts as diseased a healthy case
In assessing the performance of a classifier/test one is interested in estimating the
probabilities of each of the above four events to occur. The estimation is done based on the
respective frequencies in a test set. One usually constructs a confusion matrix containing
counts of the observed occurrences (Table 1), from which the probabilities are estimated.
Table 1. The confusion matrix and the associated probabilities.
True labels (gold standard)
Predicted labels w = −1 w = 1
Marginal
probabilities
3 = −1
True negatives
%(3 = −1|w = −1)
False negatives
%(3 = −1|w = 1)
%(3 = −1)
3 = 1
False positives
%(3 = 1|w = −1)
True positives
%(3 = 1|w = 1)
%(3 = 1)
Marginal
probabilities (priors)
%(w = −1)
%(w = 1)
(prevalence)
The following performance parameters are some of the most commonly used criteria for
judging a diagnostic test:
• disease-centric measures the performance predicting the disease: The true
positive/negative fractions (TPF, FPF):
TPF = %(3 = 1|w = 1),	FPF = %(3 = 1|w = −1)
They are both needed to characterize the test and they are dependent on the chosen
threshold. If one knows the disease prevalence (%(w = 1)), then the probability of
error can be estimated by
%(3 ≠ w) = 	%(w = 1)(1 − TPF) + 1 − %(w = 1)#FPF
A perfect test will have TPF = 1 and FPF = 0. The TPF is also called sensitivity,
while 1 − FPF is called specificity. In the clinical testing literature, the two latter
terms are more common than the first ones.
• predicted values are used to quantify the clinical value of a test (the likelihood of
disease when the test is positive): The positive/negative predicted values (PPV,
NPV) are defined as
PPV = %(w = 1|w = 1),	NPV = %(w = −1|3 = −1)
A perfect test will have PPV = NPV = 1, while one totally uninformative, PPV =
%(w = 1) and NPV = %(w = −1) = 1 − %(w = 1).
There is a simple connection between the two groups of measures and its derivation is left as
an exercise to the reader.
Since the estimators for the above measures are random variables from a Bernoulli trial, one
can compute confidence intervals (CI), using any of the proposed methods (e.g. normal
approximation, Wilson score, Agresti-Coull, and others (Newcombe, 1998)). Whatever
method is used, the confidence intervals (usually 95% CIs) must be reported for a full
characterization of the test.
As a final remark, we note that the CIs obtained based on binomial distribution refer to each
of the measures individually and do not provide a confidence region for the joint distribution
of the pairs (TPF,FPF) or (PPV,NPV). To obtain such confidence region, one can use the
following result:
Proposition. If (%†g‡, %ˆ and ‰†g‡, ‰ˆ are 1 5 Š∗
univariate confidence intervals for two
binomial random variables % and ‰, then the rectangle %†g‡, %ˆ # ‹ ‰†g‡, ‰ˆ is a
1 5 Š confidence region for %, ‰ , where Š 1 5 1 5 Š∗ 6
.
For example, from two 95% univariate confidence intervals, one can construct a 90.25%
confidence region for the joint variable.
3.2. Threshold-independent performance parameters
We have already noted that the performance measures described in the previous section
depend on the chosen value of the threshold •, and therefore we call them point estimates.
However, these tests (classifiers) may need to work in different contexts, where one may
want to select a different operating regimen (trade-off between sensitivity and specificity, or
PPV and NPV). Moreover, when comparing two tests with different operating regimens, it is
difficult to draw any conclusion. it is clear that we need a characterization of the test which is
independent of the threshold. The receiver operating characteristic (ROC) curve serves
exactly this purpose.
Figure 4. Varying the threshold t above which a score x leads to a positive test, generates
a ROC curve in the (FPF, TPF) space.
By letting the TPF and FPF varying with the threshold,
TPF • % x J •|w 51
FPF(•) = %( (x) ≥ •|w = 1)
we obtain the definition of the ROC curve:
DŽ• = • FPF(•),TPF(•)#		M		∀• ∈ ℝ}
It is easy to see that the ROC function is monotone increasing and that it is invariant to
strictly increasing transformation of the scores. The parametric form of the curve is given by
DŽ• = ’“Š,TPF FPFP9(Š)#” |	∀Š ∈ (0,1)•
A summary of the ROC curve is obtained by taking the area under the curve (AUC):
–—• = ROC(•)	$•
9
›
AUC is lower-bounded by 0.5 (corresponding to a totally uninformative test) and upperbounded
by 1. It can also be seen as the Mann-Whitney-Wilcoxon U-statistic: AUC =
%(3ž89 > 3ž8P9), i.e. the probability of correctly ordering a random pair of cases.
3.3. Performance estimation
Once a classifier is trained, one has to estimate its performance on unseen data. Lacking
access to the full data collection on which the classifier will be applied, one will have to rely
on statistical estimates of the performance. The easiest estimate would be the one obtained
by applying the classifier on the same data used for training it (plug-in estimate). Except for
a few rare cases, this estimate will be optimistically biased, i.e. will underestimate the error
rate. Furthermore, relying on the plug-in estimate will more often than not lead to overfitting
the training set, i.e. one will find the parameters such that the classifier will have minimum
error rate on the training set, but it will perform poorly on new data. The morale is that the
estimation of performance has to be done on an independent data set, completely different
than the one used for building the classifier.
One possible option would be to randomly split the data available into two disjoint subsets,
one used for building the model and one for estimating its performance (split sample
validation or holdout validation). While appealing, this method has at least two drawbacks: it
does not use the available data in an optimal way and the training set is reduced drastically in
comparison with the original sample size. However, the true validation of a classifier,
diagnostic test remains its long run application on unseen data.
In order to better use the training data, several resampling methods have been proposed,
among which: the k-fold cross-validation, Monte Carlo cross-validation, leave one out crossvalidation,
bootstrapping, etc. They all have in common the idea of repeatedly randomly
partitioning the available into a training set and a validation set. The training set thus
obtained is used for full model construction (including feature selection, meta-parameter
optimization, model selection, etc.), while the validation set is used for obtaining
intermediate estimates of the performance. At the end of the procedure, the intermediate
estimates are aggregated into a final value (usually be averaging, but more sophisticated
estimates can be used – see for example the .632 estimator below) and a measure of
variability of the estimate is also computed (variance, standard error, confidence intervals).
These methods differ in the strategy they use for partitioning the data. We will briefly
describe some of them here, while for others the reader is referred to (Duda et al, 2001;
Hastie et al, 2009).
• k-fold cross validation splits the data into k partitions and uses each of them in turn
as validation set. Typical values for k are 5 and 10 and the choice represents usually
a trade-off between a reasonable training set size and the computational burden, as
the procedure is repeated k times. Note that any two models built in this setting
share k-2 folds as training data. This means that the predictions are not totally
independent so the variance of the estimates is usually underestimated. An
improved performance estimation is obtained by repeating the k-fold cross
validation on randomly shuffled versions of the original set (the so called repeated
k-fold cross validation). The final estimate of the performance (e.g. error rate,
sensitivity, specificity, etc.) is the average of the intermediate estimates.
Figure 5. A 3-fold cross-validation scheme: each of the folds is used once and only
once as validation set (the red block).
• leave-one-out cross-validation is an extreme case of k-fold cross-validation for the
case k=1. The training/testing steps are repeated n times, where n is the sample size.
• Monte Carlo cross-validation: repeatedly splits randomly the data set into a training
and validation set. For example, it retains 2/3 of the data in training and 1/3 for
validation. The procedure is, in fact, a sequence of split-sample validations applied
on random permutations of the data. Because of the random split of the data, the
procedure does not ensure that all points are used for training and validation.
• bootstrapping, in contrast with the above methods, resamples with replacement
from the original data set, generating new training sets (bootstraps) of the same size
n. It means that the new training sets may contain duplicated training examples,
while other samples are not included. On average, the bootstraps contain 0.632 of
the original set. The procedure is repeated Ÿ times.
The .632 estimator of the error rate (or other performance measure) is given by
CZ. ¡6 0.368	CZ› > 0.632
1
Ÿ
7 CZ¥
¦
¥89
where CZ› is the plug-in error rate on the full training set and CZ¥ are the error rate
obtained at repetition b by applying the classifier on the left out data. The empirical
distribution of CZ¥ can be used for estimating the confidence intervals (for example,
the 0.025 and 0.975 quantiles of this distribution are good estimates for the lower
and upper limits of the 95% confidence interval).
All these resampling procedures for performance estimation can be implemented to preserve
the proportions of the classes from the original data set. In this case, they are called stratified
since, indeed, the sampling takes place within strata (levels) of the class label variable.
4. Guidelines for gene-based classifier development
Developing gene-based classifiers poses several specific problems, in addition to the
“classical” issues that arise when building predictive models. Some of the specific issues are
methodological, while others relate to the utility and relevance of the classifiers built.
4.1. Methodological issues
During the last decade thousands of new gene-based classifiers have been published,
covering a large palette of applications. The US Food and Drug Administration, which is
responsible for approving new diagnostic tests for medical applications, set up a series of
projects to investigate the reproducibility and reliability of decision models built on gene
expression data. These projects, gathered under the acronym of MAQC (MicroArray Quality
Control) have shown that the technology is mature enough to be used in clinical practice.
The second phase (MAQC-II) dealt specifically with classification models (Shi et al, 2010)
and put forward a number of recommendations, some of which are mentioned below.
As mentioned before, the data points lies in a high dimensional space where the number of
dimensions greatly outnumbers the data set cardinality (1 ≪ ;). This makes the problem to
be ill-posed, in the sense that, theoretically at least, there may be an infinite number of
solutions to a classification problem. This is why building a classifier requires a proper
feature (variable) selection before training the model per se, but the methods for performing
feature selection are not discussed here. We only mention that the feature selection can be
done either independently or jointly with training the classifier (may be embed into the
process of classifier training, as in the case of penalized logistic regression, for example), but
in any case it is a mandatory step.
In the early phases of development of a new classifier, one usually tries many different
algorithms before narrowing the selection to a few of them. The initial set of classifiers to be
tried should be rich enough such that a suitable model can be found. MAQC-II has shown
that in most cases the simpler methods perform as well as the more sophisticated ones on
gene expression data. In this project, more than 30,000 models have been assessed and the
conclusion was that the major factor impacting the performance of the models is the problem
difficulty and not the complexity of the algorithms thrown at it (Shi at al, 2010). Once the
initial exploratory phase completed, the list of candidate models should be short (2-3
models). These candidates should be evaluated on new data and the final model selected. The
final model can then be trained and its performance estimated (either by resampling methods
or on other independent data). This approach requires a fairly large amount of data but will
likely produce a robust model that will not be overfitted to the training data.
The performance estimation is usually the most prone to methodological errors task in
building a classifier. In theory, all the steps performed from raw data to final model must be
included in the cross-validation (or whatever resampling method) loop. However, this is not
always feasible: for example, microarray data normalization usually inspects the whole batch
of raw samples for producing the normalized data. This means that any input vector for the
classifier will be influenced by the information from other vectors in the data set, so the data
normalization step has to be performed inside the cross-validation. On the other hand, the
normalization step can be quite computationally demanding and repeating it at each iteration
will slow down the process of model assessment. While this issue is not well studied in the
literature, the common consensus is that the performance estimates are marginally impacted
by the inclusion or not of the data normalization step in the cross-validation. That is why the
overwhelming majority of studies leave the normalization outside the cross-validation.
Nevertheless, apart from the normalization step, all the other processing steps must be
included in the cross-validation (ideally, even the model selection step – if a model is
selected at the end). Failing to obey this rule will lead to an optimistically biased estimation
of performance (Varma and Simon, 2006). By far, the incorrect performance estimation for
prediction models is the most common error (Dupuy et al, 2007).
Finally, a question that still lacks a definite answer refers to the samples size needed for
developing a gene-expression classifier. Sample size estimation can be done under some
parametric assumptions: for example (Dobbin and Simon, 2005) assume a normal
multivariate distribution of the classes and derive mathematical formulae for computing the
sample size for classifier development and validation. Assumption-free approaches exist and
relies on simulations: (Popovici et al, 2010) shows how using the learning curves can be used
to estimate if increasing the sample size would bring any benefit for classifier training and
what would be the required sample size to achieve a predefined performance.
4.2. Clinical utility and relevance
The final goal of gene based classifiers is to answer a clinical or biology research need, so
they have to compete with the current predictive models used in the respective fields. Thus,
for the development and validation of clinically-relevant genomic tests a number of key
stages must be successfully fulfilled (Simon, 2006):
• identify an important therapeutic decision which would need improvement;
• the target patient population should be homogeneous enough and treatment uniform,
so that the results would be therapeutically relevant. Also, the economic
considerations should not be overlooked: the treatment options and costs of
misclassification should be such that the resulting classifier/test would be likely to
be used in clinical practice (the test itself would incur some costs as well);
• develop the classifier and perform internal validation to assess whether the classifier
appears to be sufficiently accurate relative to standard prognostic factors currently
used. This means that initial analysis should prove the superiority (performance
and/or costs at equal performance) of the new test with respect to current practice;
• translate the classifier to a platform likely to be used in practice. For example, a
classifier relying on the use of several (in the order of tens) genes, even though it
could be develop from microarray data, it is more likely that it implementation on
qPCR would be more appealing to the clinicians/laboratories;
• demonstrate reproducibility of the results;
• independent validation of the complete test in prospective clinical trials.
5. Examples of gene-based classifiers
A simple search on PubMed (www.pubmed.com) portal for scientific literature lists hundreds
of papers proposing new gene expression classifiers (sometimes called biomarkers),
reflecting the importance these tools gained in the biomedical research. It would be a futile
and inherently subjective attempt to list here “the most representative” results in the field.
Therefore we will limit ourselves to mention just three such classifiers and some of their
applications, each of them having something particular that makes them to stand out of the
crowd.
5.1. Golub’s ALL vs AML classifier
Golub’s classifier (Golub et al, 1999) represents one of the first classifiers built in the early
days of the microarrays. It was designed to distinguish between acute lymphoblastic
leukemia (ALL) and acute myeloid leukemia (AML), and was addressing the need for a
standardized test to establish the diagnostic. Their training set consisted of 1 = 38 cases (27
ALL, 11 AML) profiled on an early Affymetrix chip (; = 6817 genes). They identified 50
genes correlated with the class distinction (based on a signal-to-noise ratio measure) and
combined the genes into a score by “weighted vote” (i.e. linear combination of genes’
expression values). And then they validated their predictor on an independent collection of
34 samples. By today’s standards, this represents and easy problem, nevertheless the merit of
this first system was to prove that building classifiers on gene expression is not only feasible
but could solve important diagnostic problems. The fact that their classifier relied on know
oncogenes (like c-MYB, HOXA9) strengthen the confidence in such decision systems.
5.2. Compound covariate predictor
In (Radmacher et al, 2002) a generalization of Golub’s classifier was proposed. Again, for a
specimen / a score is computed as a weighted sum of the expression values of a number of
genes,
. =	7 ¨r
r
p.r
where the weights ¨r are the signed t-statistics measuring the association of gene t with the
class to be predicted. The sign is indicating if the gene is positively or negatively associated
with the class. The score is then compared to a threshold computed as the average of the
mean scores of each class. This compound covariate predictor is prototypical for large
number of classifiers based on gene expression. It has the appealing property of being easily
understandable as each gene contributes to the score proportional to its fold change between
the two classes.
5.3. Top scoring pairs
The final example of gene based classifier is represented by Geman’s Top Scoring Pairs
(TSP) classifier (Geman et al, 2004), described in section 2.4. The striking feature of this
classifier is its simplicity: for easier classification problems, it suffices to compare the
expression levels of only two variables (genes) for taking a decision. However, as this is
seldom enough for most of the problems, extensions of this algorithm have been proposed in
which the top pairs are combined by majority vote (Tan et al, 2005) or by weighted
combinations (Popovici et al, 2011). Despite its apparent simplicity, the classifier performs
remarkably well on a large number of problems. Moreover, as the decision is taken by
comparing the relative order of two genes, the classifier is extremely robust to noise and
translates well from one platform to another.
Recently, this classifier was used to build a predictive model for identifying the colorectal
cancer patients harboring a BRAF mutation (Popovici et al, 2012). In this study, the authors
used the TSP to build a 64-gene-based classifier (32 pairs) to distinguish the BRAF mutant
patients from those BRAF wild type and KRAS wild type. The training set consisted in 431
cases (of which 47 were BRAF mutants). Despite the highly imbalanced settings, the
classifier’s estimated performance (using repeated 5-fold cross-validation) was extremely
good (sensitivity 95.8% and specificity 86.5%). The proper use of cross-validation procedure
led to an accurate estimation of the performance, as the independent validation has shown:
on three external data sets, the aggregated performance was: sensitivity 96.0% and
specificity 86.24%. The classifier has demonstrated its good robustness, as the external
validation sets were originating from different microarray platforms than the training set.
While the original purpose of the classifier was to predict the BRAF mutant patients, when
applied to KRAS mutant population (which was not part of the training set) it segregated it
into two subpopulations with clearly different gene expression patterns (on selected
differentially expressed genes) – Figure 6.
Figure 6. Heatmap showing the different expression patterns within the KRAS mutant
population, between BRAF-mutant-like patients (those predicted by the classifier, marked in
red in the right column) and the rest of KRAS mutants.
The classifier had also strong prognostic value, i.e. it predicted the high risk patients. For
examples, Figure 7 shows the Kaplan-Meier curves for the two populations predicted by the
classifier (pred-BRAFm stands for “predicted BRAF mutant”, while pred-BRAFwt for
“predicted BRAF wild type”). This discovery is of clinical relevance, since it identifies a
larger population at risk than initially considered by the clinical practice. Also, it opens new
interventional avenues which would target specific pathways active only in this “BRAF-
mutant-like” population. Finally, it is of importance also for the design of clinical trials since
it clearly shows that the KRAS mutant population is not homogeneous and extra
stratification factors should be taken into account.
Figure 7. Survival after relapse: patients predicted to be “BRAF mutant” form a high risk
group, with a median survival time of about 12 months.
6. Some concluding remarks
In this chapter we tried to briefly present a number of key concepts for understanding the
classifiers in general and the specific issues arising from their application in the context of
gene expression data. While for optimal application of classification algorithms intimate
knowledge of the theory underlying their development is needed, for making good use of
them a more superficial understanding of the principles of rigorous classifier development is
enough. What remains extremely important is to understand the risks resulting from
improper validation and performance estimation: the classifiers will never perform as
expected.
References
Duda RO, Hart PE, Stork DG. 2001. Pattern classification. 2nd edition. John Wiley and Sons
Dupuy A, Simon R. 2007. Critical review of published microarray studies for cancer outcome and
guidelines on statistical analysis and reporting. Journal of National Cancer Institute 99:147-157
Geman D, D’Avignon C, Naiman DQ, Winslow RL. 2004. Classifying gene expression profiles from
pairwise mRNA comparisons. Statistical Applications in Genetics and Molecular Biology 3(1):19
Golub TR, Slonim DK, Tamayo P, Huard C, et al. 1999. Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring. Science vol. 286.
Hastie T, Tibshirani R, Friedman J. 2009. The elements of statistical learning. 2nd edition. Springer
Verlag
Newcombe RG. 1998. Two-sided confidence intervals for the single proportion: comparison of seven
methods. Statistics in Medicine 17:857-872
Pepe MS. 2003. The statistical evaluation of medical tests for classification and prediction. Oxford
University Press
Popovici V, Chen W, Gallas BG, Hatzis C, Shi W, Samuelson FW, Nikolsky Y, Tsyganova M, Ishkin
A, Nikolskaya T, Hess KR, Valero V, Booser D, Delorenzi M, Hortobagyi G, Shi L, Symmans WF,
Pusztai L. 2010. Effect of training sample size and classification difficulty on the accuracy of
genomic predictors. Breast Cancer Research 12(1):R5
Popovici V, Budinska E, Delorenzi M. 2011. Rgtsp: a generalized top scoring pairs package for class
prediction. Bioinformatics 27(12):1729-1730
Popovici V, Budinska E, Tejpar S, Weinrich S, Estrella H, Hodgson G, Van Cutsem E, Xie T, Bosman
FT, Roth AD, Delorenzi M. 2012. Identification of a poor-prognosis BRAF-mutant-like population
of patients with colon cancer. Journal of Clinical Oncology 30:1288-1295
Radmacher MD, McShane LM, Simon R. 2002. A paradigm for class prediction using gene expressino
profiles. Journal of Computational Biology 9(3):505-511
Shi L, MAQC consortium. 2010. The MicroArray Quality Control (MAQC)-II study of common
practices for the development and validation of microarray-based predictive models. Nature
Biotechnology 28(8)
Simon R. 2006. Roadmap for developing and validating therapeutically relevant genomic classifiers.
Journal of Clinical Oncology 23:7332-7341
Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D. 2005. Simple decision rules for classifying
human cancers from gene expression profiles. Bioinformatics 21(10):3896-3904
Varma S, Simon R. 2006. Bias in error estimation when using cross-validation for model selection.
BMC Bioinformatics 7:91