[12:40 30/7/2011 Bioinformatics-btr396.tex] Page: 2368 2368–2375 BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 17 2011, pages 2368–2375 doi:10.1093/bioinformatics/btr396 Structural bioinformatics Advance Access publication June 30, 2011 Computational reconstruction of primordial prototypes of elementary functional loops in modern proteins Alexander Goncearenco1,2 and Igor N. Berezovsky1,∗ 1Computational Biology Unit, Uni Research and 2 Department of Informatics, University of Bergen, N-5008 Bergen, Norway Associate Editor: Anna Tramontano ABSTRACT Motivation: Enzymes are complex catalytic machines, which perform sequences of elementary chemical transformations resulting in biochemical function. The building blocks of enzymes, elementary functional loops (EFLs), possess distinct functional signatures and provide catalytic and binding amino acids to the enzyme’s active sites. The goal of this work is to obtain primordial prototypes of EFLs that existed before the formation of enzymatic domains and served as their building blocks. Results: We developed a computational strategy for reconstructing ancient prototypes of EFLs based on the comparison of sequence segments on the proteomic scale, which goes beyond detection of conserved functional motifs in homologous proteins. We illustrate the procedure by a CxxC-containing prototype with a very basic and ancient elementary function of metal/metal-containing cofactor binding and redox activity. Acquiring the prototypes of EFLs is necessary for revealing how the original set of protein folds with enzymatic functions emerged in predomain evolution. Supplementary Information: Supplementary data are available at Bioinformatics online. Contact: igor.berezovsky@uni.no Received on April 1, 2011; revised on June 7, 2011; accepted on June 26, 2011 1 INTRODUCTION Current evolution of proteins takes place via mutation and recombination of protein domains, leading to both divergence and convergence. It is rather obvious, however, that evolution of protein structure and function did not start from full protein folds (Iwasaki, 2010; Koonin, 2003), and the latter has to have been formed in a predomain stage of evolution. Therefore, one of the most important tasks in studies of the evolution of protein function is to find how the first set of folds with the most basic biochemical functions was formed and to determine the set of ancient functional peptides that gave rise to the above repertoire of folds (Berezovsky et al., 2003a, b; Trifonov et al., 2001). Previous studies have demonstrated that closed loops of 25–30 amino acid residues are universal building blocks of protein folds (Berezovsky, 2003; Berezovsky and Trifonov, 2001; Berezovsky et al., 2000). According to the same studies, the closed loops ∗To whom correspondence should be addressed. emerged from ring-like peptides, primitive proteins or proteinlike molecules in prebiotic evolution, and served as building blocks of the first protein folds. Similarly to folds that are built from closed loops as structural units, the enzymatic functions can also be decomposed into combinations of elementary units of protein function. The latter are closed loops that possess one or a few functional residues and bring them to the active site (de Gennes, 1990), which we call Elementary Functional Loops (EFLs; Goncearenco and Berezovsky, 2010). The functional signatures of some EFLs are shared between (super)families of proteins with different biochemical functions and even between different folds. Their prototypes presumably underwent recombination at the predomain stage of protein evolution. Recently, we have developed a computational procedure for finding sequence profiles of widely spread EFLs with characteristic functional signatures (Goncearenco and Berezovsky, 2010). In this procedure, we obtained sequence profiles from complete proteomes. Many of the profiles correspond to their originating protein families [such as those described in PFAM (Bateman et al., 2004) or CDD (Marchler-Bauer et al., 2005)] and represent family-specific functional signatures [as in Prosite, (Sigrist et al., 2010)], while others reveal connections between different folds and functions. Here, we would like to proceed further and find a way to obtain ancient prototypes of EFLs. Therefore, we developed a procedure for reconstructing prototypes, which served as basic units of the first folds/domains with enzymatic functions. The procedure for obtaining prototypes is very different to that of ancestor reconstruction. The latter typically requires a phylogenetic tree built from the alignment of related (super)family members and an evolutionary model with particular mutation and amino acid substitution rates (Cai et al., 2004; Harms and Thornton, 2010; Mirkin et al., 2003). Here, on the contrary, we work with short sequence fragments belonging to phylogenetically unrelated proteins from remote (super)families, which presumably were involved in the design of the first enzymes in predomain evolution (Trifonov et al., 2001). We illustrate the computational strategy that we have developed by reconstructing an ancient prototype with redox-active/metalbinding elementary function. The most generic signature of this prototype reads Cys-Xaa-Xaa-Cys (-CxxC- for brevity). We start with a minimal set of non-redundant profiles from which we reconstruct the prototype and its derivatives. Using these profiles in a profile-sequence search over complete proteomes and the protein databank, we detect a diversity of EFLs with metal/metal-containing cofactor binding and redox activity in different folds and show that they take part in different biochemical functions. 2368 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com atMasarykUniversityonSeptember16,2011bioinformatics.oxfordjournals.orgDownloadedfrom [12:40 30/7/2011 Bioinformatics-btr396.tex] Page: 2369 2368–2375 Reconstruction of prototypes of elementary functional loops 2 METHODS Procedure for obtaining and charactering profiles of EFLs: we introduce here a computational procedure for reconstructing the most ancient and generic prototypes of EFLs. Previous studies were mostly focused on detecting conserved functional motifs among homologous proteins with known functions. The distinctive feature of our method is that no preliminary assumptions about the homology or function are made, and prototypes are reconstructed from the comparison of all sequence segments on a proteomic scale. First, we cut several proteomes into initial 50-residue long segments (size of closed loop 25–30 residues plus 10-residue flanks) (Berezovsky et al., 2000), iteratively compare them to non-redundant proteomic sequences, and thus obtain sequence profiles represented as position-specific scoring matrices (PSSM) [for details see (Goncearenco and Berezovsky, 2010)]. Some of the obtained sequence profiles have clear functional signatures and correspond to EFLs from various non-homologous proteins. We reconstruct prototypes from profiles that have similar functional signatures using clustering procedure described below. In order to characterize the profiles and their elementary functions we identified the corresponding EFLs in enzymatic domains with known biochemical function and structure by performing a profile-sequence search in the CDD database (Marchler-Bauer et al., 2005). In some cases it was possible to annotate the elementary function of individual residues in a profile’s signature by analyzing the CDD annotation on the domain level, in other cases we used databases such as IBIS (Shoemaker et al., 2010), MACiE (Holliday et al., 2009), Prosite (Sigrist et al., 2010) and PDBeMotif (Golovin and Henrick, 2008). Sequence profile clustering procedure: pairwise profile comparison and clustering have O(N2) complexity and memory requirements (Murtagh, 1984). Thus, processing of 104 −105 sequence profiles becomes unfeasible without a specifically developed method. We implemented a parallel program for pairwise comparison and UPGMA (Unweighted Pair Group Method with Arithmetic Mean) clustering (Sokal, 1958) of the initial set of profiles. Based on the properties of the distance matrix we introduced a heuristic, which allowed speeding the computation up significantly. In particular, identification of disconnected components in the distance matrix (above a certain profile–profile distance threshold) splits a large unclustered dataset into several smaller sets that can be clustered independently. The program source code in C using MPI is available on request. Overlap-based profile–profile comparison: profiles a and b with sets of matches A and B overlap if they match the same protein (same Gene Id) and positions of the matches are not further than 20 residues apart from each other. The distance between profiles is calculated using the Jaccard distance (Lipkus, 1999) J(a,b)=1−|A∩B|/|A∪B|, assuming that the intersection is the number of overlapping matches, and the union is the total number of unique matches in both profiles. If J =0, then all the matches of profiles overlap; if J =1, then profiles do not have any overlapping matches. PSSM-based comparison of profiles: the distance between two profiles is calculated by comparing their PSSMs. In order to account for possible profile–profile alignments, we calculate superpositions of PSSMs with maximal offset ±20 and compare the corresponding 30-residue windows. The distance between the windows is calculated as the Euclidean distance between aligned PSSM positions weighted by the total information gain (Kullback–Leibler divergence, DKL) of amino acid frequencies in the aligned positions relative to the average proteomic frequencies of amino acids (Kullback and Leibler, 1951). From all possible superpositions of two profiles, the one giving the minimal distance between 3-residue windows a and b is used. The pairwise distance reads: d = i DKL(ai)+DKL(bi) ai −bi . Benchmark preparation: sequences of 40 archaeal proteomes (Supplementary Table ST1) were compared using BLASTP (Altschul et al., 1990) (E-value ≤0.0001) against a non-redundant (up to 40% pairwise sequence identity) SCOP/ASTRAL database of structural domains, release 1.75 (Murzin et al., 1995) in order to identify structural domains in the proteomes. Homologous domains were identified in 40% of the sequences (48 269 domains in 36 718 sequences). Redundancy between the identified domains has been removed with cd-hit program (Li and Godzik, 2006) down to 70% sequence identity. The benchmark required a set of superfamilies for which the coverage could be robustly measured. Therefore, we selected 150 SCOP superfamilies (TOP150) that were represented by more than 50 non-redundant domains in our dataset. The fold census of the TOP150 set is shown in Supplementary Figure S1. 3 RESULTS The goal of this work is to obtain prototypes of elementary functions and their derivatives. We define a prototype as the most generic and ancient functional signature, and its derivatives as descendants (EFLs) with a diversity of contemporary functions. We develop a special benchmark for sequence search with profiles of EFLs, introduce and evaluate a new PSSM-based measure for profileprofile comparison, and reconstruct prototypes by clustering the profiles. As a case study, we explore a prototype with -CxxCcontaining signature and its derivatives. 3.1 Sequence profile search benchmark The functional diversity of a prototype can be estimated by the number of SCOP superfamilies found by its associated profiles of EFLs (assuming that the superfamilies themselves are unrelated). Since it is important to select out random hits from real matches to superfamilies we need a way to estimate the sensitivity (detection power) of a profile in the sequence search. A true match to a superfamily can be asserted by a sufficient coverage of the superfamily by sequence search. As in any sequence–sequence or profile–sequence search, we resort to the E-value parameter to control the number of expected false positives. We need to determine the minimal E-value, which gives the maximal number of true positive hits. Already existing benchmarks can only be used for evaluating the sequence–sequence and sequence–profile search on the level of whole proteins or domains, and not on the level of short sequence fragments. Therefore, we were prompted here to create the TOP150 benchmark, which we use to show that E-value calculated in our procedure is a correct estimate of the expected number of false positives. Figure 1A shows that our search procedure provides a clear separation between true and false hits by a plateau phase as function of E-value. The plateau widens with increasing coverage from 0% (black line) to 70% (green line), resulting in only true hits in superfamilies with high coverage (above 50%, yellow line). As a result, by controlling the coverage we obtain the number of ‘true’superfamilies corresponding to the prototype. This level is indicated by an arrow in Figure 1A, showing where the red, blue and yellow plateaus are aligned. In the routine search, where coverage is not controlled (black line), the minimal E-value can be determined as the level at which the number of covered superfamilies reaches an already determined ‘true superfamily plateau’. Correspondingly, the end of the plateau on the black line designates the maximal E-value. The interval between the plateau on the black line and the one on red/blue lines (few hits) determines the false positive rate expected in the search with no control of coverage. At lower E-values (E <0.0005) several hits will be missed as ‘false negatives’. At higher E-values (E >1) false positive rate will increase exponentially (see the black line in the figure). The TOP150 benchmark helps to determine the optimal 2369 atMasarykUniversityonSeptember16,2011bioinformatics.oxfordjournals.orgDownloadedfrom [12:40 30/7/2011 Bioinformatics-btr396.tex] Page: 2370 2368–2375 A.Goncearenco and I.N.Berezovsky Fig. 1. Benchmark for profile-sequence search and profile–profile comparison. (A) Dependency between the E-value of profile-sequence search and the number of TOP150 superfamilies found with a certain level of coverage (color lines). The plateau phase observed at high coverage (shown with an arrow) indicates the number of ‘true positive’ superfamilies. The corresponding range of E-values (0.0005