A Novel Method for Assigning Functional Linkages to Proteins Using Enhanced Phylogenetic Trees Hung Xuan Ta 1∗, Patrik Koskinen 2 and Liisa Holm 1,2 1 Institute of Biotechnology, PO Box 56, 00014 University of Helsinki, Finland 2 Division of Genetics, Faculty of Biological Sciences, PO Box 56, 00014 University of Helsinki, Finland ABSTRACT Motivation: Functional linkages implicate pairwise relationships between proteins that work together to implement biological tasks. During evolution, functionally linked proteins are likely to be preserved or eliminated across a range of genomes in a correlated fashion. Based on this hypothesis, phylogenetic profiling based approaches try to detect pairs of protein families that show similar evolutionary patterns. Traditionally, the evolutionary pattern of a protein is encoded by either a binary profile of presence and absence of this protein across species or an occurrence profile that indicates the distribution of copies of this protein across species. Results: In our study, we characterize each protein by its enhanced phylogenetic tree, a novel graphical model of the evolution of a protein family with explicitly marked by speciation and duplication events. By topological comparison between enhanced phylogenetic trees, we are able to detect the functionally associated protein pairs. Because the enhanced phylogenetic trees contain more evolutionary information of proteins, our method shows greater performance and discovers functional linkages among proteins more reliably compared to the conventional approaches. Contact: xuanhung.ta@helsinki.fi, liisa.holm@helsinki.fi Sumplementary information: Sumplementary data are available at Bioinformatics online 1 INTRODUCTION A great number of cellular behaviors are mediated by proteins which always carry out their functions by interacting with each other (Eisenberg et al., 2000). Proteins that participate in a common structural complex or pathway or biological process are defined as functionally linked. Defining functional linkages of proteins is one of the central goals in proteomics that will decipher the molecular mechanisms underlying the biological functions and, then, help to enhance approaches for drug discovery. The complete sequencing of multiple genomes from diverse species provides an excellent opportunity to develop comparative approaches for functional studies in proteomics. Biological interactions and functional linkages of proteins can be inferred via various patterns across many genomes. These include the co-localization of genes on chromosomes (Dandekar et al., 1998; Overbeek et al., 1999), the ∗to whom correspondence should be addressed genetical fusion of two distinct proteins from one organism into a single protein in another organism (Enright et al., 1999; Marcotte et al., 1999), protein domains composition (Sprinzak and Margalit, 2001; Ta and Holm, 2009) and phylogenetic profiles (Pellegrini et al., 1999; Marcotte et al., 2000; Wu et al., 2003; Dutkowski and Tiuryn, 2009). Phylogeny-based methods for protein functional relationship prediction are broadly applicable due to a sufficiently large number of completely sequenced genomes. These methods are premised on the hypothesis that functionally linked proteins evolve in a correlated fashion, and therefore they have homologs in the same set of organisms (Pellegrini et al., 1999). The original approach uses the simplest form of phylogenetic profile of a protein, a binary string in which each bit indicates the presence or absence of a protein family in a different species (Pellegrini et al., 1999). This basic method, often referred to “phylogenetic binary profiling”, subsequently searches for matching pairs of profiles which differ from each other by less than three bits. In further steps, profiles can be compared by statistical correlation measures, such as co-occurrence probability (Wu et al., 2003), the Pearson correlation coefficient (Glazko and Mushegian, 2004), Fisher’s exact test (Barker and Pagel, 2005) or mutual information (Huynen et al., 2000; Date and Marcotte, 2003). Recently, Ranea et al. (2007) introduced phylogenetic occurrence profiling to detect the functionally related protein families in eukaryotic genomes. The phylogenetic occurrence profile of a protein family is a vector in which each element indicates the number of protein members of this family observed in one organism. Unlike prokaryotic genomes, where a high proportion of protein families have about one copy per species, eukaryotic genomes show a large number of multi-protein families having more than one member per species (Ranea et al., 2007). Phylogenetic occurrence profiling, therefore, is able to detect more evolutionary signals that could not be detected by phylogenetic binary profiling. However, binary and occurrence profiles cannot adequately describe the whole evolutionary history of proteins. Many methods characterize proteins by their reconstructed phylogenetic trees. These approaches, then, compare the phylogenetic trees by utilizing the parsimony principle (Barker et al., 2007), maximum likelihood model (Barker and Pagel, 2005), a tree-kernel model (Vert, 2002) or the correlation between the distances matrices used to build the trees (Pazos and Valencia, 2001). 1 Associate Editor: Prof. Burkhard Rost © The Author (2010). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Bioinformatics Advance Access published December 17, 2010 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom Evolutionary divergence, convergence and horizontal transfer events in multi-protein families are more complex than can be described by conventional phylogenetic profiles. In this work, we introduce a novel graphical model of the evolution of proteins which enchances the information content of phylogenetic profiles by accounting for the reconstructed proteomes of ancestral species and synchronous gene duplication events. Whereas orthologs evolved by vertical descent (speciation) from a single protein in the last common ancestor of the compared genomes, paralogs evolved by gene duplication. In this work, one-to-one correspondences between orthologs are detected using the reciprocal best hits criterion, and duplications are identified using the InParanoid criterion (O’Brien et al., 2005). This so-called enhanced phylogenetic tree (EPT) explicitly traces all the descendants of proteins from the last universal common ancestor (LUCA) down to the extant species. This tree is different from the tree reconciliation (Zmasek and Eddy, 2002) in that protein data is directly fitter on the species tree, and speciation or duplication events are inferred using an exclusion principle to ensure strict one-to-one correspondences between proteins derived by speciation. The EPTs, subsequently, are topologically compared to find the evolutionarily correlated proteins families. Two proteins of a target organism belonging to two correlated families are likely to be functionally linked. In this paper, we benchmark the EPT method by using positive and negative reference sets of functional linkages in Homo sapiens and Saccharomyces cerevisiae. The biological features of the predicted functional linkages are examined by studying GO annotations in the biological process, cellular component, and molecular function branches of Gene Ontology. Our novel method shows a significant improvement compared to conventional phylogenetic profiling methods and makes predictions that are complementary to other prediction methods. 2 METHODS 2.1 Benchmark Datasets Benchmark sets including positive and negative reference sets (PRS and NRS, respectively) are needed to calculate the Receiver Operating Characteristic (ROC), which represents the performance of a classification method. By true positive (TP), we mean an interaction or linkage that is predicted by our method and present in the PRS. Analogously, a true negative (TN) is a pair of proteins predicted by our method not to interact that is present in the NRS. A false positive (FP) is a protein pair in the NRS that is predicted to interact by our method while a false negative (FN) is a protein pair in the PRS predicted not to interact by our method. In the plot of the ROC curve, the x-axis represents false positive rate (FPR) or 1-specificity, that is FP/(TN+FP), and the y-axis represents true positive rate (TPR) or sensitivity, TP/(TP+FN), as the threshold gradually varies. To evaluate the performance of our methods at predicting protein functional linkages, we choose the pairs of proteins existing within the same complex as referent positives and the random pairs of proteins from different complexes, which are not connected in the interaction graph, as referent negatives. For yeast datasets, 5888 pairs of co-complex proteins and 5,888 randomly chosen pairs were derived from the manually curated catalogue in MIPS (Mewes et al., 2004). The benchmark datasets of human functional linkages are derived from the CORUM database, a resource of manually annotated protein complexes from mammalian organisms (Ruepp et al., 2010). After filtering out cocomplex protein pairs discovered by high-throughput experimental methods, we obtained 26,813 pairs of protein for the human PRS. The human NRS contains 26,813 pairs of proteins chosen randomly from different complexes. 2.2 Protein Families Phylogenetic profiling requires that proteins from different organisms are classified into families which can be compared across species. We clustered the proteins in the proteomes of 572 complete genomes (560 prokaryotic and 12 eukaryotic organisms) based on their BLAST (Basic Local Alignment Search Tool) similarity. This study considered similarities above a bitscore of 50, which effectively filters false positive hits from the sequence comparison (Remm et al., 2001). We tested both lower and higher thresholds, which yielded worse performance in the benchmark. The clusters have an internal hierarchical structure that guarantees strict one-to-one correspondences between proteins from different species in each subfamily. The clustering algorithm is a hierarchical modification of the InParanoid algorithm (Remm et al., 2001). The inputs are the species tree and the protein sequences of a set of organism. The algorithm outputs proteins classified into families with gene duplication events mapped to taxa in the species tree. The details of the clustering algorithm are shown in Figure 1. There were 2,188,588 proteins in the proteomes. 254,418 singletons (about 11,6%) had no similarity to other proteins. The other proteins were clustered into 91,428 protein families (EPTs) having more than one member. The average number of leaf proteins in an EPT was 21.2 approximately. The distribution of EPT size is scale free, i.e. it follows a power law. 2.3 Enhanced Phylogenetic Trees The strength of functional linkage between two protein families is assessed by comparing the topologies of the corresponding EPTs. Figure 2 shows two examples of EPT, tree A and tree B. In tree A, four round-dot lines represent the proteomes of four organisms. Tree A (B) has four (two) subtrees f1, f2, f3, f4 (f1, f2), forming in three (two) layers. Each subtree, representing a subfamily, is composed of the same color leaves, internal node and solid edges (speciation). They are separated by dashed edges (duplication). The corresponding occurrence and binary profiles of the family are shown below. Intuitively, the EPT contains more evolutionary signals than conventional profiling. The NCBI taxonomy tree is used in constructing EPTs. In a subtree, the dashed edges (empty nodes) indicate the edges (nodes) of the NCBI taxonomy tree that do not exist in the subtree. In tree A, the first layer contains the subtree f1, the second layer contains f2 and f3, and the third layer contains f4. 2.4 Subtree similarity The topological similarity of two subtrees is computed guided by the NCBI taxonomy tree, meaning that the comparison is based on the taxonomy identifiers of nodes in two subtrees (see Figure 3). Let TA i be a part of subtree A that roots from node i and C(i, A) denote the set of children of node i in subtree A. The similarity of two subtrees A and B rooting from node i, which is denoted by S(TA i , TB i ) , can be calculated as follows S(TA i , TB i ) = α + k∈C(i,A)∩C(i,B) S(TA k , TB k ) − β |(C(i, A) − C(i, B)) ∩ (C(i, B) − C(i, A))| (1) where X − Y denotes the set of objects that belong to X and not to Y , X ∩ Y denotes the set of objects that belong to both set X and set Y and |X| means the number of elements in set X. In Equation (1), α(β) is the reward (penalty) coefficient. From Equation (1) we have S(TA i , TB j ) = −β if j = i or one of two trees is null. In words, each common node yields a reward α and each deletion of a subtree costs a penalty β. In the current implementation, we require that two subtrees have the same root node identifiers, that is, only subtrees that were created by gene duplications at the same taxonomic node contribute to the EPT score (the tree in the first layer is always rooted at LUCA). 2 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom Fig. 1. Schematic example to show how the family definition algorithm works. Phase I: The proteomes of ancestral species are reconstructed hierarchically based on the species tree. Here, species ABC is reconstructed from child species A, B and C. Proteins are clustered based on all versus all Blast hits between the child species. The list of Blast hits is sorted by similarity score in decreasing order. Initially, each child protein has an image in the ancestor but the actual parent is yet unassigned. Child proteins which derive from the same ancestral protein are identified based on Blast hits with the highest similarity score. Only one protein from each child species may be assigned to the same ancestral protein. Recent duplications (generating in-paralogs within a child species) have higher similarity scores than BLAST hits to a sister species. Singleton proteins have no similarity to sister species; they are assumed to be present in the ancestor. Finally, the algorithm removes image nodes from the ancestor which are without progeny. Phase II: Here, the root node (LUCA) is reconstructed from ABC and D. The similarity scores from reconstructed proteins are estimated as the maximum of the similarities of the cluster members; for example, ABC2-D1 is the maximum of A1-D1, B1-D1 and C1-D1. The final assignments after the reconstruction are shown. Following the edges from the leaf proteins to reconstructed LUCA proteins gives the protein families and lineages shown in Outputs. 2.5 Measuring the Similarity of Enhanced Phylogenetic Trees Two EPTs are compared layer-by-layer (see an example in Figure 2). Let TA ijk be subtree k in the ith layer of EPT A whose root taxonomy identifier is j. Let tA i be the set of the root taxonomy identifiers of the subtrees in the ith layer of EPT A. The number of layers of EPT A is denoted by A. Fig. 2. Two examples of EPTs and their decompositions. Trees A and B are compared layer-by-layer. The missing layer in tree B is penalized (correspoding to the third term of Equation (2)). Subtrees (f1, f1) in the 1st layer and (f2, f2) in the 2nd layer are compared (correspoding to the first term of Equation (2)). The missing subtree in the second layer of tree B is penalized (correspoding to the second term of Equation (2)). The similarity of two EPTs A and B, with an assumption that A > B, is calculated as follow SAB = B i=1 wi    j∈tA i ∩tB i    1 nA ijnB ij nA ij k=1 nB ij l=1 S(TA ijk, TB ijl)    −β    j∈(tA i −tB i ) nA ij + j∈(tB i −tA i ) nB ij       + A i= B+1 wi(−βnA i ) (2) where S(T1, T2) notifies the similarity between two subtrees T1, T2 which is computed by Equation (1). In Equation (2, nA ij = {TA ijk}∀k denotes the number of subtrees whose root taxonomy identifier is j in the ith layer of EPT A and nA i = j∈tA i nA ij is the number of all the subtrees in the ith layer of EPT A. Therefore, the first term compares subtrees having the same root taxonomy identifiers, the second term is the penalty for subtrees in one layer of an EPT and not in the corresponding layer of the other EPT and the third term is the penalty for missing layers. In this study, we set wi = 1/i, meaning that the layers are assigned weights depending on their lineage. Finally the scores are normalized as follows SAB = SAB − min{SAB}A,B max{SAB}A,B − min{SAB}A,B . (3) 2.6 Predicting Functional Linkages of Proteins A protein is characterized by the EPT of the family to which it belongs. Two similar EPTs mean correlated patterns of evolution and, by implication, a functional linkage. We construct EPTs for all the families that contain more than one member and at least one of the members is a protein of the target organism. Two proteins whose EPTs have a higher similarity score than a given threshold are classified as functionally linked or physically interacting proteins. 3 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom Fig. 3. The topological comparison of two trees is based on the taxonomy identifiers of the nodes in two trees. The numbers labeling the nodes in the trees are taxonomy identifiers. The similarity is recursively calculated by Equation 1. In this example, there are five common nodes (black) and three subtree deletions (dashed subtrees). 2.7 Calculating GO Semantic Similarity To investigate the biological features of predicted functionally linked protein pairs, we quantify the functional similarity for these pairs by computing the semantic similarity (SS) score (Wang et al., 2007) of the GO terms with which these proteins are annotated. The semantic score takes into account both the level of hierarchy of GO terms and also their relations with their ancestor terms. All predicted protein pairs are assigned semantic similarity scores for three GO branches: molecular function (MF); biological process (BP) and cellular component (CC). Semantic scores in this study were calculated using GOSemSim package in Bioconductor (Gentleman et al., 2004). Functional linkages are predicted among all the proteins in yeast and human proteomes by all the competing methods (binary profiles, occurrence profiles, and EPT). The mean GO semantic score is calculated over all the protein pairs in the top predictions to present the biological features of the predicted pairs. The method that shows higher mean GO-SS at the same number of top predictions has stronger biological implications. 3 RESULTS AND DISCUSSION 3.1 Evaluation of Performance of the EPT Method at Predicting Functional Linkages Benchmark datasets are tested by the enhanced phylogenetic tree method. We compare our method with the conventional methods using binary profile with Pearson correlation (Glazko and Mushegian, 2004) (called bin ps) and occurrence profile (Ranea et al., 2007) with Pearson correlation (called occ ps) or Euclidean distance (called occ ed) (see Supplementary Methods ). The binary and occurrence profiles are converted from the corresponding EPTs as described in Figure 2A. Figure 4 shows the assessment of the EPT method for different test sets including human (see Figure 4A) and yeast (see Figure 4B) PRS and NRS of functional linkages. It is clear that the EPT method can capture part of the co-evolutionary signal of protein functional linkages. In particular, by using a threshold of 0.245 (0.546), the EPT method discovers about 20% (27%) functional linkages in human (yeast) datasets with a low level of false positive rate (around 5%) (see Figure 4 and Figure S2 in the supplementary material). For the top 1000 predictions, we obtained a precision of more than 90% both in the human and in the yeast datasets (see Insets in Figure 4). The EPT method shows a drastic improvement compared to traditional approaches using binary or occurrence profiles. In the Fig. 4. ROC curves of the EPT (blue), occ ed (green), occ ps (magenta) and bin ps (black) methods for human (A) and yeast (B) datasets. The dashed gray diagonal lines are corresponding to random predictions. The ROC curves with full scales are presented in Figure S1 in the supplementary data file. The insets of (A) and (B) are the precisions of the methods at different number of top predictions (cases with highest scores) for human and yeast datasets, respectively. The precision is the ratio of the number of true positives over the number of top predictions, TP/(TP+FP). top 1000 predictions of human datasets, the EPT method shows a precision of more than 90% while others show precisions less than 70%. For yeast datasets, the precision of the EPT method is about 90% while those of other methods are about 60%. The advance of the EPT method seems to be a direct consequence of using the EPT as an evolutionary pattern. It is clear that the similarity between two EPTs reflects the co-evolution of the pair of proteins belonging to these EPTs. Clearly, the EPT method helps to reduce FPs by other methods using binary and occurrence profiles (see Figure S3). In the top 1000 human predictions by the bin ps method, there are 343 FPs (> 34%). The number of FPs in the top 1000 predictions by the occ ed and occ ps methods are 289 (≈ 29%) and 378 (≈ 38%), respectively. Nearly the FPs by the traditional methods are out of the top 1000 predictions by the EPT method, meaning that they are likely identified as TNs by the EPT methods. For yeast datasets, most of the FPs in the top 1000 predictions by the traditional methods is also identified as TNs by the EPT method. Namely, the EPT method ranks 404 among 418 FPs (≈ 97%) by the bin ps method out of the top 1000 predictions. Our method also identified 420 /425 FPs (259/274 FPs ) by the occ ed (occ ps) method as TNs. Two examples of potential FPs by the traditional methods that are identified as TNs by the EPT methods are presented in Figure 5A for human datasets and Figure 5B for yeast datasets. The upper and lower panels of Figure 5A, respectively, show the EPTs, occurrence and binary profiles of negative elongation factor A protein (Q9H3P2, NELFA HUMAN) and syntaxin-12 protein (Q86Y82, STX12 HUMAN) which belong to different complexes (Ruepp et al., 2010). They seem to participate in different biological processes, implement different molecular functions and locate in different cellular components (Ashburner et al., 2000). iHOP shows no evidence for functional associations or physical interactions between these two proteins (Hoffmann and Valencia, 2004). NELFA HUMAN and STX12 HUMAN share the identical binary profiles and very similar occurrence profiles, making them to be in the top predictions by traditional methods. However, their EPTs are topologically different. There are about 55% of all the protein pairs that have higher EPT scores than this pair does, 4 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom Fig. 5. Examples of potential FPs by the traditional methods that are TNs by the EPT method. The EPTs, the corresponding occurrence and binary profiles of protein Q9H3P2 (upper) and Q86Y82 (below) in human (A) and P12945 and P32602 in yeast (B). The grey (red) numbers are the taxonomy identifiers of missing (present) proteins. The grey characters are the taxonomy names of the roots of the missing subtrees. The speciation, duplication and missing edges are represented by solid black, red and dashed edges. meaning that NELFA HUMAN and STX12 HUMAN are identified as unlikely to be functionally linked by the EPT method. Figure 5B shows a protein pair, N-terminal acetyltransferase A complex subunit protein (P12945, NAT1 YEAST) and Alpha-soluble NSF attachment protein (P32602, SEC17 YEAST), in yeast that is a TN by the EPT method but a FP by traditional methods using binary and occurrence profiles. These above examples indicate that the use of EPTs provides a significant increase of the precision and sensitivity for co-evolution signal detection. 3.2 Biological Feature of Predicted Protein Functional Linkages The relationship between the prediction methods and Gene Ontology (GO) attributes (Ashburner et al., 2000) is shown in Figure 6. For biological process (BP) and cellular component (CC), the average GO semantic score (GO-SS) of the top predictions by the EPT method is higher than that of the top predictions by other methods (see Figure 6B and 6C for yeast and Figure 6E and 6F for human), indicating that the EPT method is able to find functional linkages among proteins that participate in the same biological processes or locate in the same cellular components. Figure 6A and 6D show that the linkages predicted by the EPT method exist between proteins which do not perform the same molecular functions. This is as expected because typically a set of different biochemical activities (molecular functions) are required to implement biological processes. 3.3 Examples of Predicted Functional Linkages Linking Cancer to DNA Repair Complexes Table S1 and Table S2 in the suplementary material show several pairs of proteins that participate in the same biological process in Fig. 6. GO-SS for molecular function (A, D), biological process (B, E) and cellular component (C, F) of yeast and human prediction, respectively. The x axes represent the number of top predictions (cases with highest prediction scores) by the EPT method (blue), occ ed (green), occ ps (magenta) and the bin ps (black) and the y axes represent the average GO-SS of the top predictions. yeast and human. These pairs have very high EPT scores so they are predicted as likely functional linkages by the EPT methods. A number of these predicted linkages are confirmed by experimental assays. The others can be considered as “novel” linkages which might be candidates for further experimental validation. An interesting example of yeast functional linkages predicted by the EPT method is the linkage between the large (alpha) and small (beta) subunits of yeast phenylalanyl-tRNA synthetase (Genes FRS1 and FRS2, Uniprot accession numbers P15625 and P15624 , 5 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom respectively). The association among these two proteins has been found by many experimental assays using Affinity Capture-MS methods (Snitkin et al., 2006; Gavin et al., 2002, 2006; Krogan et al., 2006; Collins et al., 2007). Among 6,931,774 yeast scored protein pairs, the EPT method ranks the pair of P15625 and P15624 as the 13th , indicating that this pair is likely identified as a TP by the EPT method. However, this pair is out of top 100,000 predictions by any conventional methods using either occurrence or binary profiles (top 7% for occ ed, top 32% for occ ps and top 50% for bin ps), meaning that this protein pair is potentially identified as a FN. As an example of human predicted functional linkages by the EPT method, we predict several interaction partners for DNA-directed polymerase theta (POLQ, Uniprot accession number Q59EE4). The predicted interaction partners of POLQ include proliferating cell nuclear antigen (PCNA, P12004), replication factor C (RFC4, P35249), and structural maintenance of chromosomes protein 1A (SMC1A, Q14683). Among more than 109 scored pairs of human proteins, the pair POLQ-PCNA is ranked 403rd . These two proteins are both mapped to DNA replication process. There has not been any direct evidence by experimental assays on the relationship between proliferating cell nuclear antigen and polymerase (DNA directed) theta protein. However, there exist associations between PCNA and other DNA polymerases, namely DNA polymerase kappa protein (Haracska et al., 2002; Maga et al., 2002; Shimazaki et al., 2002), polymerase (DNA directed) delta protein Liu et al. (2003); Ohta et al. (2002); Pohler et al. (2005); Ducoux et al. (2001), and DNA polymerase eta protein (POLH, Q9Y253). POLH is specifically involved in DNA repair. DNA polymerase kappa and delta require activator 1 (alias RFC1- 5) in addition to PCNA. The association between POLQ and PCNA is a very promising target for experimental validation. The second of POLQ’s predicted interaction partners, RFC4, is part of activator complex 1 (composed of RFC1-5), which is known to bind to PCNA (Ohta et al., 2002; Cai et al., 1997). The third predicted interaction partner, SMC1A, is linked to disease. SMC1A is a central component of the cohesin complex, which apparently forms a large proteinaceous ring within which sister chromatids can be trapped after DNA replication. The cohesin complex interacts with a number of other proteins, including breast cancer associated BRCA1. Defects in SMC1A are the cause of Cornelia de Lange syndrome type 2 (CDLS2)[MIM:300590]; also known as Cornelia de Lange syndrome X-linked. CDLS is a clinically heterogeneous developmental disorder associated with malformations affecting multiple body parts. Mutated Cornelia de Lange cell lines display genomic instability and sensitivity to ionizing radiation and interstrand cross-linking agents (Apweiler et al., 2004). Now that is an interesting piece of information, because POLQ has been implicated in the repair of interstrand cross-links (Apweiler et al., 2004). So we hypothesize that SMC1A mutations manifest a radiation sensitive phenotype because of a disrupted physical association of the cohesin complex with POLQ. 4 CONCLUSIONS All the phylogeny based methods for predicting functional protein linkages are based on the common observation of the similarity between evolutionary patterns of interacting proteins. Therefore, the performance of these methods strongly depends on how the evolutionary patterns of proteins are described. In prokaryotes, protein families typically contain one copy per species (Ranea et al., 2007). This facilitates the prediction of functional linkages of protein using binary profiles. However, binary phylogenetic profiles are not evolutionarily informative enough for protein families in eukaryotic genomes, many of which contain more protein homologues per species. Occurrence profiles partly overcome the problem with multi-protein families in eukaryotes by counting the number of protein copies per species. Enhanced phylogenetic trees, which are graphical models of the evolution of proteins by speciation and duplication events, add richer information about the evolutionary history of proteins and help to overcome the problems which the conventional profiles encounter. The EPT algorithm creates groups of proteins using the reciprocal best BLAST criterion that is commonly used to detect orthologs. The orthodox way to define orthologs is a subject of hot debate (Ouzounis, 1999; Koonin, 2001); we note that any protein family classification can be used for phylogenetic profiling and perfect recognition of orthologs is not necessary for our prediction purposes. The current EPT method produces a simple clustering of proteins based on BLAST scores. The method can be improved in a number of ways to take into account the complexity of evolutionary relations including divergence, convergence, domain recombination and horizontal gene transfer events. For example, the choice of the descendant protein is somewhat arbitrary as BLAST scores do not reflect the complexity of these relations. We are going to address this problem in the future by using synteny information for complete genomes to maximize the number of syntenic protein pairs within layers of an EPT. Protein sequence profiles (e.g. HMMER (Eddy, 1998)) would be better representatives of the ancestral sequences than a single representative as in the current version. The present version of the version only models gene loss whereas inferring de novo creation or horizontal gene transfer requires post-processing of the trees. The EPT algorithm was designed so that singletons, as any other protein, are propagated to ancestral species for possible merging with clusters coming from other lineages. We define the ancestral taxon of an EPT as the smallest taxonomic node that encompasses all non-zero instances of the leaf proteins. In the future, we will test setting nodes to zero above the ancestral taxon, as well as modelling possible horizontal gene transfer events (which generate more than one origins of extant protein groups in the EPT). Like other phylogeny based methods, enhanced phylogenetic tree method is limited at inferring physical interactions of proteins but is promising at predicting functionally linked proteins. Functionally linked proteins were benchmarked against GO semantic similarity. Unlike domain-based approaches where domains, e.g. binding domains, are believed to perform the same molecular function but may occur in proteins that participate in very different biological processes, the thinking behind such a per-protein model as the EPT is that orthologs will be performing the same biological process across species. There are tools available which combine evidence - often weak and always noisy - from many different types of experimental and computational data to make integrated predictions of functional linkages (e.g. String, FunCoup). Examination of our top predictions indicated that they were often complementary to these existing tools, but had strong evidence from co-expression data. We believe that the EPT method would 6 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom be a valuable, orthogonal addition to such integrative tools. Our method significantly surpasses conventional methods in prediction performance and potentially discovers more reliable functional linkages. ACKNOWLEDGEMENT We thank the members of Bioinformatics group for discussion and Alexander Goesmann of the University of Bielefield for providing recourses for computation. Funding: Marie Curie grant (MRTN-CT-2005-019475). REFERENCES Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., O’Donovan, C., Redaschi, N., and Yeh, L.-S. L. (2004). UniProt: the Universal Protein knowledgebase. Nucl. Acids Res., 32, D115–119. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., IsselTarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Nat Genet, 25(10), 25–29. Barker, D. and Pagel, M. (2005). Predicting functional gene links from phylogeneticstatistical analyses of whole genomes. PLoS Comput Biol, 1(1), e3. Barker, D., Meade, A., and Pagel, M. (2007). Constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes. Bioinformatics, 23(1), 14–20. Cai, J., Gibbs, E., Uhlmann, F., Phillips, B., Yao, N., ODonnell, M., and Hurwitz, J. (1997). A Complex Consisting of Human Replication Factor C p40, p37, and p36 Subunits Is a DNA-dependent ATPase and an Intermediate in the Assembly of the Holoenzyme. Journal of Biological Chemistry, 272(30), 18974–18981. Collins, S. R., Kemmeren, P., Zhao, X.-C., Greenblatt, J. F., Spencer, F., Holstege, F. C. P., Weissman, J. S., and Krogan, N. J. (2007). Toward a Comprehensive Atlas of the Physical Interactome of Saccharomyces cerevisiae. Molecular Cellular Proteomics, 6(3), 439–450. Dandekar, T., Snel, B., Huynen, M., and Bork, P. (1998). Conservation of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences, 23(9), 324 – 328. Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotech, 21, 1055–1062. Ducoux, M., Urbach, S., Baldacci, G., Hbscher, U., Koundrioukoff, S., Christensen, J., and Hughes, P. (2001). Mediation of Proliferating Cell Nuclear Antigen (PCNA)dependent DNA Replication through a Conserved p21Cip1-like PCNA-binding Motif Present in the Third Subunit of Human DNA Polymerase . Journal of Biological Chemistry, 276(52), 49258–49266. Dutkowski, J. and Tiuryn, J. (2009). Phylogeny-guided interaction mapping in seven eukaryotes. BMC Bioinformatics, 10(1), 393. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755–763. Eisenberg, D., Marcotte, E. M., Xenarios, I., and Yeates, T. O. (2000). Protein function in the post-genomic era. Nature, 405, 823–826. Enright, A. J., Iliopoulos, I., Kyrpides, N. C., and Ouzounis, C. A. (1999). Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 86–90. Gavin, A.-C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J. M., Michon, A.-M., Cruciat, C.-M., Remor, M., Hofert, C., Schelder, M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D., Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier, M.A., Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G., Raida, M., Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G., and SupertiFurga, G. (2002). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415(6868), 141–147. Gavin, A.-C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C., Jensen, L. J., Bastuck, S., Dumpelfeld, B., Edelmann, A., Heurtier, M.-A., Hoffman, V., Hoefert, C., Klein, K., Hudak, M., Michon, A.-M., Schelder, M., Schirle, M., Remor, M., Rudi, T., Hooper, S., Bauer, A., Bouwmeester, T., Casari, G., Drewes, G., Neubauer, G., Rick, J. M., Kuster, B., Bork, P., Russell, R. B., and SupertiFurga, G. (2006). Proteome survey reveals modularity of the yeast cell machinery. Nature, 440(7084), 631–636. Gentleman, R., Carey, V., Bates, D., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J., and Zhang, J. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, 5(10), R80. Glazko, G. and Mushegian, A. (2004). Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns. Genome Biol, 5(5), R32. Haracska, L., Unk, I., Johnson, R. E., Phillips, B. B., Hurwitz, J., Prakash, L., and Prakash, S. (2002). Stimulation of DNA Synthesis Activity of Human DNA Polymerase kappa by PCNA. Mol. Cell. Biol., 22(3), 784–791. Hoffmann, R. and Valencia, A. (2004). A gene network for navigating the literature. Nat Genet, 36, 664. Huynen, M., Snel, B., Lathe, W., and Bork, P. (2000). Predicting Protein Function by Genomic Context: Quantitative Evaluation and Qualitative Inferences. Genome Res, 10(8), 1204–1210. Koonin, E. (2001). An apology for orthologs - or brave new memes. Genome Biology, 2(4), comment1005.1–comment1005.2. Krogan, N. J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu, S., Datta, N., Tikuisis, A. P., Punna, T., Peregrn-Alvarez, J. M., Shales, M., Zhang, X., Davey, M., Robinson, M. D., Paccanaro, A., Bray, J. E., Sheung, A., Beattie, B., Richards, D. P., Canadien, V., Lalev, A., Mena, F., Wong, P., Starostine, A., Canete, M. M., Vlasblom, J., Wu, S., Orsi, C., Collins, S. R., Chandran, S., Haw, R., Rilstone, J. J., Gandi, K., Thompson, N. J., Musso, G., St Onge, P., Ghanny, S., Lam, M. H. Y., Butland, G., Altaf-Ul, A. M., Kanaya, S., Shilatifard, A., O’Shea, E., Weissman, J. S., Ingles, C. J., Hughes, T. R., Parkinson, J., Gerstein, M., Wodak, S. J., Emili, A., and Greenblatt, J. F. (2006). Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature, 440(7084), 637–643. Liu, L., Rodriguez-Belmonte, E. M., Mazloum, N., Xie, B., and Lee, M. Y. W. T. (2003). Identification of a Novel Protein, PDIP38, That Interacts with the p50 Subunit of DNA Polymerase and Proliferating Cell Nuclear Antigen. Journal of Biological Chemistry, 278(12), 10041–10047. Maga, G., Villani, G., Ramadan, K., Shevelev, I., Le Gac, N. T., Blanco, L., Blanca, G., Spadari, S., and Hbscher, U. (2002). Human DNA Polymerase Functionally and Physically Interacts with Proliferating Cell Nuclear Antigen in Normal and Translesion DNA Synthesis. Journal of Biological Chemistry, 277(50), 48434–48440. Marcotte, E. M., Pellegrini, M., Ng, H.-L., Rice, D. W., Yeates, T. O., and Eisenberg, D. (1999). Detecting Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 285(5428), 751–753. Marcotte, E. M., Xenarios, I., van der Bliek, A. M., and Eisenberg, D. (2000). Localizing proteins in the cell from their phylogenetic profiles. Proc Natl Acad Sci U S A, 97(22), 12115–12120. Mewes, H. W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G., Munsterkotter, M., Pagel, P., Strack, N., Stumpflen, V., Warfsmann, J., and Ruepp, A. (2004). MIPS: analysis and annotation of proteins from whole genomes. Nucl. Acids Res., 32, D41–44. O’Brien, K. P., Remm, M., and Sonnhammer, E. L. L. (2005). Inparanoid: a comprehensive database of eukaryotic orthologs. Nucl. Acids Res., 33, D476–480. Ohta, S., Shiomi, Y., Sugimoto, K., Obuse, C., and Tsurimoto, T. (2002). A Proteomics Approach to Identify Proliferating Cell Nuclear Antigen (PCNA)-binding Proteins in Human Cell Lysates: IDENTIFICATION OF THE HUMAN CHL12/RFCs25 COMPLEX AS A NOVEL PCNA-BINDING PROTEIN. Journal of Biological Chemistry, 277(43), 40362–40367. Ouzounis, C. (1999). Orthology: another terminology muddle. Trends in Genetics, 15, 445. Overbeek, R., Fonstein, M., DSouza, M., Pusch, G. D., and Maltsev, N. (1999). The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences of the United States of America, 96(6), 2896–2901. Pazos, F. and Valencia, A. (2001). Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng, 14(9), 609–614. Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T. O. (1999). Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proceedings of the National Academy of Sciences of the United States of America, 96(8), 4285–4288. Pohler, J. R., Otterlei, M., and Warbrick, E. (2005). An in vivo analysis of the localisation and interactions of human p66 dna polymerase delta subunit. BMC 7 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom Molecular Biology, 6(1), 17. Ranea, J. A. G., Yeats, C., Grant, A., and Orengo, C. A. (2007). Predicting protein function with hierarchical phylogenetic profiles: The gene3d phylo-tuner method applied to eukaryotic genomes. PLoS Comput Biol, 3(11), e237. Remm, M., Storm, C. E., and Sonnhammer, E. L. (2001). Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of Molecular Biology, 314(5), 1041 – 1052. Ruepp, A., Waegele, B., Lechner, M., Brauner, B., Dunger-Kaltenbach, I., Fobo, G., Frishman, G., Montrone, C., and Mewes, H.-W. (2010). CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucl. Acids Res., 38, D497–501. Shimazaki, N., Yoshida, K., Kobayashi, T., Toji, S., Tamai, K., and Koiwai, O. (2002). Over-expression of human DNA polymerase lambda in E. coli and characterization of the recombinant enzyme. Genes to Cells, 7(7), 639–651. Snitkin, E., Gustafson, A., Mellor, J., Wu, J., and DeLisi, C. (2006). Comparative assessment of performance and genome dependence among phylogenetic profiling methods. BMC Bioinformatics, 7(1), 420. Sprinzak, E. and Margalit, H. (2001). Correlated sequence-signatures as markers of protein-protein interaction. Journal of Molecular Biology, 311(4), 681 – 692. Ta, H. X. and Holm, L. (2009). Evaluation of different domain-based methods in protein interaction prediction. Biochemical and Biophysical Research Communications, 390(3), 357 – 362. Vert, J.-P. (2002). A tree kernel to analyse phylogenetic profiles. Bioinformatics, 18, S276–284. Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S., and Chen, C.-F. (2007). A new method to measure the semantic similarity of GO terms. Bioinformatics, 23(10), 1274–1281. Wu, J., Kasif, S., and DeLisi, C. (2003). Identification of functional links between genes using phylogenetic profiles. Bioinformatics, 19(12), 1524–1530. Zmasek, C. and Eddy, S. (2002). Rio: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics, 3(1), 14. 8 atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom