Searching for robust and clinically relevant subtypes in high density molecular data Eva Budinská1,2,3 1 Institute of Biostatistics and Analyses, Masaryk University; e-mail: budinska@iba.muni.cz 2 Swiss Institute of Bioinformatics, Lausanne, Switzerland 3 Masaryk Memorial Cancer Institute, Brno Abstract Identification of clinically relevant molecular subtypes became an important tool in elucidating tumor biology. Robustness of the analytical approach employed for this task and consequently the robustness of derived subtypes is of major concern, if further experiments are to be conducted to confirm derived hypotheses. Here, we discuss multiple novel techniques for the control of robustness in cluster analysis designed for analysis of high-density molecular data. Key words Gene expression, molecular subtyping, consensus clustering, dynamic tree cut 1. Introduction Subtyping based on high density molecular data, such as microarrays, aims at identifying groups of samples with similar molecular patterns. These can be for instance similar patterns of gene expression, microRNA expression or methylation. Finding molecular subtypes is very relevant mainly in medicine, especially in diseases that appear homogenous histo-pathologically, yet give a very heterogeneous response in terms of treatment outcome or survival. Recently, a lot of effort is dedicated to molecular subtyping of different cancers. Breast cancer for instance, was the first one where gene expression subtyping was applied, revealing a set of groups, that serve until now a basis for treatment consideration (Perou et al., 2002). Molecular subtypes help to elucidate the underlying biological mechanisms responsible for heterogeneity in tumour behaviour and help to focus the research on the subtype specific drugs targets, with hope to optimize treatment and ensure better prognosis of the given cancer as a whole. In order to make molecular subtypes clinically relevant, many additional analyses elucidating the biological, clinical and prognostic inference of subtypes are needed. The analysis becomes a fairly complex process involving different data-mining and statistical tools, together with thorough bio-medical interpretation of results. Hereby, we will focus on the most important part of the subtyping procedure – robust clustering. 2. Example dataset Throughout this article, we will use two datasets: golub dataset - a microarray derived gene expression dataset available in R package multtest under the name golub, comprising 38 samples of three groups of acute leukaemia (AML – acute myeloid leukaemia, ALL-B – acute lymphoid leukaemia B cell type and ALL-T – acute lymphoid leukaemia T-cell type) and gene expression values of 3051 genes. This was the first dataset used to demonstrate the use of gene expression data in cancer studies (Golub et al., 1999). random dataset - a matrix of 1000 features and 100 samples, randomly sampled from normal distribution with 0 mean value and standard deviation equal to 1. This dataset will serve an example of a dataset without particular inner structure. 3. Robust clustering Clustering – or, so called usupervised learning - is the analytical approach used for subtype derivation. The main objective of clustering is to find distinct, preferably non-overlapping subpopulations within the large population of interest, members of which share similar pattern. Different basic clustering techniques exist and can be divided into model-based and distance-based methods. The model based methods use parametric assumptions on data distribution and often provide probabilities of cluster assignment. The distance-based methods are based on a similarity measure and can be further split into hierarchical and non-hierarchical, according to the algorithm they apply in order to group the samples. The detailed description of these methods and discussion on the choice of metrics is beyond the scope of this article and can be found elsewhere (Budinska et al., 2009). In large genomic studies, hierarchical clustering is a particularly preferred method, because of its pattern visualization advantage. Often, not only clusters of samples, but also clusters of features – molecular entities - that underpin biological differences are of importance. Heatmap – a colored two dimensional plot with rows and columns representing samples and genes, ordered according to the hierarchical clustering dendrogram is one of the most often published type of figure in the field of large-scale molecular data (with the exception of DNA sequencing). It is well known that the choice of clustering algorithm and metrics affects the final results, because clustering algorithms are biased towards partitions in accordance with their own clustering criterion. Moreover, clustering algorithms are designed to provide a data partition, even in non-existence of such a pattern, and the significance of these results must be assessed ad-hoc. While the clustering algorithms and corresponding metrics can be selected a-priori, based on the data type, our experience or published recommendations, two main issues are still to be addressed: i) the determination of the number of clusters and ii) the assessment of the confidence of the selection of number of clusters and cluster assignment for individual samples. Missing the external measure of class assignment (ground truth), the evaluation of clusters is based solely on internal validation measures, estimating the quality based on the intrinsic data values. These issues are of particular importance in the data analysis of high density molecular data, which suffer from the curse of dimensionality problem. The small number of samples (tens to hundreds) and relatively huge number of molecular features (thousands, tens of thousands) makes clustering techniques susceptible to over-fitting, due to the sensitivity to noise, which is in these data much more abundant. This highly affects the robustness of the clustering to the sampling variability. Resampling of the original dataset is away to simulate sampling variability. Although the idea of resampling in clustering is not new (Jain and Moreau, 1988), in the case of more noisy high-density molecular data, the preference is to avoid sampling with replacement, because replicated values can be artificially considered a separate cluster (Monti et al., 2003). Multiple methods have been recently suggested to address these problems in the concept of microarray data analysis, mainly based on repeated resampling and consequent re-clustering of the original dataset, in order to study the behavior of the results when data is disturbed. This approach is simulating possible differences between different datasets, presumably resulting in a more robust result (for a review, see e.g. Handl et al., 2005). For example a prediction-based resampling method Clest was designed (Dudoit and Fridlyand, 2002) in order to robustly estimate the number of clusters, showing the superiority of its performance in microarray data over six other methods, including more conventional such as Silhouette (Kaufman, Rousseeuw, 1990), or more recent such as gap (Tibshirani et al., 2001). However, this method does not solve the problem of the assessment of the confidence of cluster assignment for individual samples. A new method assessing both problems – consensus clustering (Monti et al., 2003) - was suggested and was successfully applied in different cancer subtyping analyses. In a comparative study (Giancarlo et al., 2008) this method was also evaluated the best method in terms of performance and algorithm independency. We will dedicate the following subsection to the description of this method. 3.1. Consensus clustering Is a resampling and re-clustering based method designed to represent the consensus across multiple runs of a clustering algorithm (Monti et al., 2003), in order to: · determine the number of clusters in the data and to assess the stability of the discovered clusters · represent the consensus over multiple runs of a clustering algorithm with random restart, so as to account for its sensitivity to the initial conditions. In addition, it serves a visualization tool for the evaluation cluster number, membership, and boundaries. The basic principle is to disturb the structure of the original N × P data matrix by random selection of a subset of samples and/or features. The new dataset is then consequently clustered, given the selected clustering algorithm, similarity measure and number of clusters or tree cut height. This resampling and clustering is repeated L times. In the l-th run, the cluster membership of samples is recorded and two N × N matrices are created: * connectivity matrix that stores for each pair of samples i,j the information whether they were clustered together, e.g. if sample i and j belong to the same cluster, 0 otherwise * indicator matrix that stores for each pair i,j the information whether they were both selected in the resampling, e.g. if sample i and j were in the same selection, 0 otherwise After all l runs, the consensus matrix is calculated by dividing the number of times two features were found together in the same cluster by the number of times that they have been selected together in the sampling subsets. A consensus matrix is therefore a N × N matrix that stores for each pair of items the weighted proportion of clustering runs in which the two items were clustered together: The idea behind this approach is that samples that are frequently found in the same cluster represent more reliable cluster members than those who cluster together less frequently, being more sensible to the random noise and changes in feature selection. Each entry of the consensus matrix is a consensus index of a given pair of samples, with values from 0 (no consensus, samples were never members of the same cluster) to 1 (perfect consensus, clusters were members of the same cluster 100% times). Consensus matrix M represents a robust similarity measure and 1-M is a distance matrix, that can be used as an entry to hierarchical clustering in order to obtain a final robust clustering. Figure 1 demonstrates a result of hierarchical clustering applied on consensus matrix on our example data. While the consensus matrix of the random dataset is very unstructured, a very clear three-class structure is visible for the golub dataset. The consensus matrix between samples can be directly used to define statistics of stability of clusters and cluster sample assignments. If I[k] be a set of indices of samples belonging to cluster k, the consensus measure of a cluster k - cluster consensus - can be defined as an average consensus index between all pairs of samples belonging to the same cluster: Figure 1. Heatmap representation of the consensus matrix for the random dataset – left and the golub dataset – right, for three clusters. The colour ranges from white representing 0 consensus to bright blue, representing the consensus of 1. The corresponding sample consensus for each sample and cluster l can be defined as: , where is the indicator function that equals 1 if is true, 0 otherwise. The sample consensus is the average consensus index of the sample to all members of the cluster. Both measures can be used to identify outliers – either clusters with relatively low consensus, suggesting remaining heterogeneity in the cluster, or samples, that could be considered outliers because of very small consensus to any other sample in the dataset. Consensus matrix can be also used to estimate the optimal number of clusters. For details, see section 2.2. 3.1.1. Other consensus clustering techniques Multiple variations of the consensus clustering method exist and are a natural extension of the original algorithm. Method called merged consensus clustering (Swift et al., 2004), in contrast to the method of Monti et al., creates the consensus matrix as a function of runs of consensus clustering with multiple different algorithms. This should eliminate the possible negative effect a single algorithm, which might not be suitable for the particular type of data. Weighted clustering (Deohdar and Ghosh, 2006) builds on the idea that the clusterings produced within a consensus clustering procedure are not necessarily of the same quality. If an external metrics of quality exists, one should be able to integrate this in order to weigh the contributions of each clustering to the final consensus matrix, which is then calculated as , where w[k] is weight of the particular clustering. This method also uses different clustering algorithms and different distance measures. For the comparison of different consensus clustering algorithms, see for example (Goder and Filkov, 2008). 3.1.2. R-packages for consensus clustering Two major packages are available in R for consensus clustering. The package ConsensusClusterPlus (Wilkerson, 2011) provides all the algorithms and metrics as described in (Monti et al., 2003). It is a part of Bioconductor repository and can be installed directly from R console using command: >source(“http://bioconductor.org/biocLite.R”) >biocLite(“ConsensusClusterPlus”) Package clusterCons implements the merged clustering of (Swift et al., 2004). It can be installed directly from R console by using the install.packages()command. 3.2. Determining the number of subtypes In this section, two methods for determining the number of clusters are discussed. Both were developed specially for microarray data analysis and hierarchical clustering algorithm. 3.2.1. Consensus measure Consensus matrix – as described in section 2 - can be also used to estimate the optimal number of clusters. If consensus clustering is run for different cluster number values k=1..K, the decision criteria can be based on the calculation of for example the average intra-cluster consensus for each k. (Monti et al., 2003) propose another measure - the empirical cumulative distribution (CDF) , which compares the distribution of histograms of entries of consensus matrix M for each k. If clustering with k clusters represents a perfect partition, histogram of consensus matrix entries will consist of two bins over 0 (no consensus at all between samples from different clusters) and 1 (perfect consensus between clusters from different samples). The optimal number of clusters can then be decided by computing the area under CDF curve and by examining its relative change between different k (delta area). The CDF measure, however, is applicable mainly for hierarchical clustering, for which the method was designed. Figure 2 shows examples of histograms for k=3 and k=6 and CDF and delta area for the golub data. While histogram of consensus measures for the three cluster structure (heatmap on Figure 1 right) reveals indeed majority of values on 0 or 1, six cluster structure has a substantially decreased number of perfect consensus between samples and increased number of values between 0-1 suggesting instability of this number of clusters. The delta area plot shows that increasing number of clusters from 2 to 3, the area under CDF gains around 0.36, while further increasing the number of clusters to 5 has no real impact on the area under CDF change and therefore the estimated value of k would be 3 or 4 subtypes. Figure 2. Example of CDF derivation and selection of number of clusters on golub dataset. Consensus CDF and delta area plot are shown for k varying from 2 to 10. 3.2.2. Dynamic Tree Cut As already mentioned, hierarchical clustering has its particular importance in genomic data analysis. In comparison to other clustering techniques, clusters are defined ad-hoc, by cutting the branches of the hierarchically structured similarity tree – dendrogram – the output of this clustering – on a fixed height. All the branches below this cut are preserved and represent final clusters. The major disadvantage of this static cut approach is that often, different clusters are present on different cut heights – naturally presenting more or less similar groups of samples, and cutting low in order to obtain a cluster with high internal similarity results in the loss of structure of clusters with relatively lower similarity. To address this problem, a set of novel dynamic branch cutting methods for detecting clusters in a dendrogram of hierarchical clustering was recently proposed (Langfelder et al, 2007). In this approach, clusters are being defined depending on their shape. The huge advantage is that the system of cluster determination is flexible – a set of parameters can be used to control the resulting cut – such as for instance cut height, minimal cluster size or minimal intra cluster. First method called Dynamic Tree - this is a flexible extension of the static cut, works solely with the structure of the dendrogram. The second method DynamicHybrid dynamically crawls the dendrogram in the bottom-up direction and after defining clusters offers a possibility of additional assignment of the unassigned samples to the closest core clusters defined in the first step, if the requirements on cluster internal similarity are met. The description of both algorithms is fairly complex and I do strongly recommend the reader to consult the original paper for more details. Dynamic Tree Cut methods are implemented in R package dynamicTreeCut. An example of comparison of a static and dynamic cut of dendrogram is demonstrated on the golub dataset in Figure 3. Both static cut and DynamicHybrid algorithm (represented by function cutreeHybrid) were run with cut height of 1.2. Minimal cluster size selected for cutreeHybrid was 3 and 5. While the static cut on the selected height identifies 3 clusters, cutHybrid with minimal cluster size of 5 identifies four major clusters. Decreasing the cluster size to 3 identifies further, yet still consistent splits. Figure 3. Comparison of static and DynamicHybrid cut (as output by cutreeHybrid function of the R package dynamicTreeCut)on the dendrogram from hierarchical clustering with average linking algorithm and correlation-based distance between samples of golub data. 4. Other analytical challenges Robustness of findings is one of the most important aspects of the applied research, and is indispensable for the clinical relevance. In order to call the subtypes robust, it is vital that the patterns defining the groups we find are not specific for a particular dataset, but can be found in other similar populations. We say that subtypes must be validated. However, the validation in a de-novo developed subtyping system has somewhat different meaning and is a much less evident analytical task than in the construction of classifiers. This is because it is not obvious to validate a pattern without existing objective class label (the ground truth). Without the ground truth, the validation can be done only indirectly, by the assessment of subtype specific differences in population characteristics that were not used for their construction. Different survival experience or clinically relevant variables are examples of such characteristics. Often, a development of a subtype classifier is necessary in order to make the results applicable for the practice. Preferably, such a classifier will be accurate and robust to different technological platforms used to derive data and will be able to classify one sample only. This classifier can also serve to call subtypes in the validation set. However, in a complex analysis of subtyping, many decisions on types of methods and choice of parameters must be made. Although some can be subjected to sensitivity analysis exploring the effect of different choices on clustering results (such as similarity metrics or clustering algorithms), it is almost impossible to perform such an analysis for all considered parameters and algorithmic choices, due to the complexity of the problem. For this reason, simple application of the classifier on the validation set and consequent comparison of external characteristics of training vs. validation subtypes is not the optimal solution. Better solution would be to reproduce the subtyping on the validation set, using the same methods and parameters as selected for the training set. 5. Concluding remarks We have seen a selection of state-of-the-art approaches for robust clustering in the molecular subtyping. However, the field is evolving very quickly and reader is strongly encouraged to search for the methodological improvements and critically review all the information provided with respect to the nature of the particular data analysed. Some concepts remain, though, the same. The robustness and reproducibility in clinical research is indispensable. One should never search for the final and unchangeable answer – which is almost impossible to achieve because of the nature of biology and technological limitations - but rather focus on the extraction of the most essential information from the data that are available. In this respect, application of consensus clustering base methods seems inevitable, although in case of hierarchical clustering, one might consider to use rather dynamic Tree Cut for cluster assignment, as it allows for identification of core samples, without forcing the less representative samples to be assigned a cluster membership. References Budinska E, Bortlicek Z. 2009. E-learning E-learning in analysis of genomic and proteomic data. URL: http://telemedicina.med.muni.cz/genomic-proteomic-analysis/index-en.php Deodhar M, Ghosh J. 2006. Weighted Consensus Clustering for Microarray Data Analysis. In: Dagli, CG et al. (ed): Intelligent Engineering Systems through Artificial Neural Networks, Volume 16, ACMA Dudoit S, Fridlyand J. 2002. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol. 2002 Jun 25;3(7) Goder A and Filkov V. Consensus Clustering Algorithms: Comparison and Refinement. 2008 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX) — San Francisco, 19 January 2008. Society for Industrial and Applied Mathematics. Golub TR, Slonim DK, Tamayo P, Huard C, et al. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science vol. 286. Giancarlo R, Scaturro D and Utro F. 2008 Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 2008, 9:462 Handl J, Knowles J and Kell DB. 2005. Computational cluster validation in post-genomic data analysis. Bioinformatics 21 (15): 3201-3212. Jain AK. and Moreau J. 1988. Bootstrap techniques in cluster analysis. Pattern Recognition 20, 547–568 Kaufman L, Rousseeuw PJ. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley Langfelder P, Zhang B, Horvath S. 2007. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 24(5):719-720 Monti S, Tamayo P, Mesirov J, Golub T. 2003. Consensus clustering - A resampling-based method for class discovery and visualization of gene expression microarray dataAquatic Microbial Ecology 30: 83–89. Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lønning PE, Børresen-Dale AL, Brown PO, Botstein D. 2000. Molecular portraits of human breast tumours. Nature 406:747–52 R Core Team. 2012. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/. Simpson TI, Armstrong JD and Jarman AP. 2010 Merged consensus clustering to assess and improve class discovery with microarray data. BMC Bioinformatics, 11:590. Tibshirani, R., Walther, G. and Hastie, T. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63: 411–423 Wilkerson M. 2011. ConsensusClusterPlus: ConsensusClusterPlus. R package version 1.8.0.