A Novel Method for Assigning Functional Linkages to
Proteins Using Enhanced Phylogenetic Trees
Hung Xuan Ta 1∗, Patrik Koskinen 2 and Liisa Holm 1,2
1
Institute of Biotechnology, PO Box 56, 00014 University of Helsinki, Finland
2
Division of Genetics, Faculty of Biological Sciences, PO Box 56, 00014 University of Helsinki,
Finland
ABSTRACT
Motivation: Functional linkages implicate pairwise relationships
between proteins that work together to implement biological tasks.
During evolution, functionally linked proteins are likely to be preserved
or eliminated across a range of genomes in a correlated fashion.
Based on this hypothesis, phylogenetic proﬁling based approaches
try to detect pairs of protein families that show similar evolutionary
patterns. Traditionally, the evolutionary pattern of a protein is encoded
by either a binary proﬁle of presence and absence of this protein
across species or an occurrence proﬁle that indicates the distribution
of copies of this protein across species.
Results: In our study, we characterize each protein by its enhanced
phylogenetic tree, a novel graphical model of the evolution of a
protein family with explicitly marked by speciation and duplication
events. By topological comparison between enhanced phylogenetic
trees, we are able to detect the functionally associated protein pairs.
Because the enhanced phylogenetic trees contain more evolutionary
information of proteins, our method shows greater performance and
discovers functional linkages among proteins more reliably compared
to the conventional approaches.
Contact: xuanhung.ta@helsinki.ﬁ, liisa.holm@helsinki.ﬁ
Sumplementary information: Sumplementary data are available at
Bioinformatics online
1 INTRODUCTION
A great number of cellular behaviors are mediated by proteins
which always carry out their functions by interacting with each
other (Eisenberg et al., 2000). Proteins that participate in a common
structural complex or pathway or biological process are deﬁned
as functionally linked. Deﬁning functional linkages of proteins
is one of the central goals in proteomics that will decipher the
molecular mechanisms underlying the biological functions and,
then, help to enhance approaches for drug discovery. The complete
sequencing of multiple genomes from diverse species provides
an excellent opportunity to develop comparative approaches for
functional studies in proteomics. Biological interactions and
functional linkages of proteins can be inferred via various patterns
across many genomes. These include the co-localization of genes
on chromosomes (Dandekar et al., 1998; Overbeek et al., 1999), the
∗to whom correspondence should be addressed
genetical fusion of two distinct proteins from one organism into a
single protein in another organism (Enright et al., 1999; Marcotte
et al., 1999), protein domains composition (Sprinzak and Margalit,
2001; Ta and Holm, 2009) and phylogenetic proﬁles (Pellegrini
et al., 1999; Marcotte et al., 2000; Wu et al., 2003; Dutkowski and
Tiuryn, 2009).
Phylogeny-based methods for protein functional relationship
prediction are broadly applicable due to a sufﬁciently large number
of completely sequenced genomes. These methods are premised
on the hypothesis that functionally linked proteins evolve in a
correlated fashion, and therefore they have homologs in the same set
of organisms (Pellegrini et al., 1999). The original approach uses the
simplest form of phylogenetic proﬁle of a protein, a binary string in
which each bit indicates the presence or absence of a protein family
in a different species (Pellegrini et al., 1999). This basic method,
often referred to “phylogenetic binary proﬁling”, subsequently
searches for matching pairs of proﬁles which differ from each other
by less than three bits. In further steps, proﬁles can be compared by
statistical correlation measures, such as co-occurrence probability
(Wu et al., 2003), the Pearson correlation coefﬁcient (Glazko and
Mushegian, 2004), Fisher’s exact test (Barker and Pagel, 2005) or
mutual information (Huynen et al., 2000; Date and Marcotte, 2003).
Recently, Ranea et al. (2007) introduced phylogenetic occurrence
proﬁling to detect the functionally related protein families in
eukaryotic genomes. The phylogenetic occurrence proﬁle of a
protein family is a vector in which each element indicates the
number of protein members of this family observed in one organism.
Unlike prokaryotic genomes, where a high proportion of protein
families have about one copy per species, eukaryotic genomes show
a large number of multi-protein families having more than one
member per species (Ranea et al., 2007). Phylogenetic occurrence
proﬁling, therefore, is able to detect more evolutionary signals that
could not be detected by phylogenetic binary proﬁling.
However, binary and occurrence proﬁles cannot adequately
describe the whole evolutionary history of proteins. Many methods
characterize proteins by their reconstructed phylogenetic trees.
These approaches, then, compare the phylogenetic trees by utilizing
the parsimony principle (Barker et al., 2007), maximum likelihood
model (Barker and Pagel, 2005), a tree-kernel model (Vert, 2002)
or the correlation between the distances matrices used to build the
trees (Pazos and Valencia, 2001).
1
Associate Editor: Prof. Burkhard Rost
© The Author (2010). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Bioinformatics Advance Access published December 17, 2010
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom
Evolutionary divergence, convergence and horizontal transfer
events in multi-protein families are more complex than can be
described by conventional phylogenetic proﬁles. In this work, we
introduce a novel graphical model of the evolution of proteins
which enchances the information content of phylogenetic proﬁles by
accounting for the reconstructed proteomes of ancestral species and
synchronous gene duplication events. Whereas orthologs evolved
by vertical descent (speciation) from a single protein in the last
common ancestor of the compared genomes, paralogs evolved by
gene duplication. In this work, one-to-one correspondences between
orthologs are detected using the reciprocal best hits criterion, and
duplications are identiﬁed using the InParanoid criterion (O’Brien
et al., 2005). This so-called enhanced phylogenetic tree (EPT)
explicitly traces all the descendants of proteins from the last
universal common ancestor (LUCA) down to the extant species.
This tree is different from the tree reconciliation (Zmasek and
Eddy, 2002) in that protein data is directly ﬁtter on the species
tree, and speciation or duplication events are inferred using an
exclusion principle to ensure strict one-to-one correspondences
between proteins derived by speciation. The EPTs, subsequently,
are topologically compared to ﬁnd the evolutionarily correlated
proteins families. Two proteins of a target organism belonging to
two correlated families are likely to be functionally linked.
In this paper, we benchmark the EPT method by using positive
and negative reference sets of functional linkages in Homo
sapiens and Saccharomyces cerevisiae. The biological features
of the predicted functional linkages are examined by studying
GO annotations in the biological process, cellular component,
and molecular function branches of Gene Ontology. Our novel
method shows a signiﬁcant improvement compared to conventional
phylogenetic proﬁling methods and makes predictions that are
complementary to other prediction methods.
2 METHODS
2.1 Benchmark Datasets
Benchmark sets including positive and negative reference sets (PRS
and NRS, respectively) are needed to calculate the Receiver Operating
Characteristic (ROC), which represents the performance of a classiﬁcation
method. By true positive (TP), we mean an interaction or linkage that is
predicted by our method and present in the PRS. Analogously, a true negative
(TN) is a pair of proteins predicted by our method not to interact that is
present in the NRS. A false positive (FP) is a protein pair in the NRS that is
predicted to interact by our method while a false negative (FN) is a protein
pair in the PRS predicted not to interact by our method. In the plot of the
ROC curve, the x-axis represents false positive rate (FPR) or 1-speciﬁcity,
that is FP/(TN+FP), and the y-axis represents true positive rate (TPR) or
sensitivity, TP/(TP+FN), as the threshold gradually varies.
To evaluate the performance of our methods at predicting protein
functional linkages, we choose the pairs of proteins existing within the same
complex as referent positives and the random pairs of proteins from different
complexes, which are not connected in the interaction graph, as referent
negatives. For yeast datasets, 5888 pairs of co-complex proteins and 5,888
randomly chosen pairs were derived from the manually curated catalogue in
MIPS (Mewes et al., 2004).
The benchmark datasets of human functional linkages are derived from
the CORUM database, a resource of manually annotated protein complexes
from mammalian organisms (Ruepp et al., 2010). After ﬁltering out cocomplex
protein pairs discovered by high-throughput experimental methods,
we obtained 26,813 pairs of protein for the human PRS. The human NRS
contains 26,813 pairs of proteins chosen randomly from different complexes.
2.2 Protein Families
Phylogenetic proﬁling requires that proteins from different organisms are
classiﬁed into families which can be compared across species. We clustered
the proteins in the proteomes of 572 complete genomes (560 prokaryotic and
12 eukaryotic organisms) based on their BLAST (Basic Local Alignment
Search Tool) similarity. This study considered similarities above a bitscore of
50, which effectively ﬁlters false positive hits from the sequence comparison
(Remm et al., 2001). We tested both lower and higher thresholds, which
yielded worse performance in the benchmark. The clusters have an internal
hierarchical structure that guarantees strict one-to-one correspondences
between proteins from different species in each subfamily. The clustering
algorithm is a hierarchical modiﬁcation of the InParanoid algorithm (Remm
et al., 2001). The inputs are the species tree and the protein sequences of a
set of organism. The algorithm outputs proteins classiﬁed into families with
gene duplication events mapped to taxa in the species tree. The details of the
clustering algorithm are shown in Figure 1.
There were 2,188,588 proteins in the proteomes. 254,418 singletons
(about 11,6%) had no similarity to other proteins. The other proteins were
clustered into 91,428 protein families (EPTs) having more than one member.
The average number of leaf proteins in an EPT was 21.2 approximately. The
distribution of EPT size is scale free, i.e. it follows a power law.
2.3 Enhanced Phylogenetic Trees
The strength of functional linkage between two protein families is assessed
by comparing the topologies of the corresponding EPTs.
Figure 2 shows two examples of EPT, tree A and tree B. In tree A,
four round-dot lines represent the proteomes of four organisms. Tree A
(B) has four (two) subtrees f1, f2, f3, f4 (f1, f2), forming in three (two)
layers. Each subtree, representing a subfamily, is composed of the same
color leaves, internal node and solid edges (speciation). They are separated
by dashed edges (duplication). The corresponding occurrence and binary
proﬁles of the family are shown below. Intuitively, the EPT contains more
evolutionary signals than conventional proﬁling. The NCBI taxonomy tree
is used in constructing EPTs. In a subtree, the dashed edges (empty nodes)
indicate the edges (nodes) of the NCBI taxonomy tree that do not exist in
the subtree. In tree A, the ﬁrst layer contains the subtree f1, the second layer
contains f2 and f3, and the third layer contains f4.
2.4 Subtree similarity
The topological similarity of two subtrees is computed guided by the NCBI
taxonomy tree, meaning that the comparison is based on the taxonomy
identiﬁers of nodes in two subtrees (see Figure 3). Let TA
i be a part of
subtree A that roots from node i and C(i, A) denote the set of children
of node i in subtree A. The similarity of two subtrees A and B rooting from
node i, which is denoted by S(TA
i , TB
i ) , can be calculated as follows
S(TA
i , TB
i ) = α +
k∈C(i,A)∩C(i,B)
S(TA
k , TB
k ) −
β |(C(i, A) − C(i, B)) ∩ (C(i, B) − C(i, A))| (1)
where X − Y denotes the set of objects that belong to X and not to
Y , X ∩ Y denotes the set of objects that belong to both set X and set Y
and |X| means the number of elements in set X. In Equation (1), α(β) is
the reward (penalty) coefﬁcient. From Equation (1) we have S(TA
i , TB
j ) =
−β if j = i or one of two trees is null. In words, each common node
yields a reward α and each deletion of a subtree costs a penalty β. In the
current implementation, we require that two subtrees have the same root
node identiﬁers, that is, only subtrees that were created by gene duplications
at the same taxonomic node contribute to the EPT score (the tree in the ﬁrst
layer is always rooted at LUCA).
2
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom
Fig. 1. Schematic example to show how the family deﬁnition algorithm
works. Phase I: The proteomes of ancestral species are reconstructed
hierarchically based on the species tree. Here, species ABC is reconstructed
from child species A, B and C. Proteins are clustered based on all versus
all Blast hits between the child species. The list of Blast hits is sorted
by similarity score in decreasing order. Initially, each child protein has an
image in the ancestor but the actual parent is yet unassigned. Child proteins
which derive from the same ancestral protein are identiﬁed based on Blast
hits with the highest similarity score. Only one protein from each child
species may be assigned to the same ancestral protein. Recent duplications
(generating in-paralogs within a child species) have higher similarity scores
than BLAST hits to a sister species. Singleton proteins have no similarity
to sister species; they are assumed to be present in the ancestor. Finally,
the algorithm removes image nodes from the ancestor which are without
progeny. Phase II: Here, the root node (LUCA) is reconstructed from ABC
and D. The similarity scores from reconstructed proteins are estimated as the
maximum of the similarities of the cluster members; for example, ABC2-D1
is the maximum of A1-D1, B1-D1 and C1-D1. The ﬁnal assignments after
the reconstruction are shown. Following the edges from the leaf proteins to
reconstructed LUCA proteins gives the protein families and lineages shown
in Outputs.
2.5 Measuring the Similarity of Enhanced Phylogenetic
Trees
Two EPTs are compared layer-by-layer (see an example in Figure 2). Let
TA
ijk be subtree k in the ith layer of EPT A whose root taxonomy identiﬁer
is j. Let tA
i be the set of the root taxonomy identiﬁers of the subtrees in
the ith layer of EPT A. The number of layers of EPT A is denoted by A.
Fig. 2. Two examples of EPTs and their decompositions. Trees A and
B are compared layer-by-layer. The missing layer in tree B is penalized
(correspoding to the third term of Equation (2)). Subtrees (f1, f1) in the
1st layer and (f2, f2) in the 2nd layer are compared (correspoding to the
ﬁrst term of Equation (2)). The missing subtree in the second layer of tree B
is penalized (correspoding to the second term of Equation (2)).
The similarity of two EPTs A and B, with an assumption that A > B, is
calculated as follow
SAB =
B
i=1
wi



j∈tA
i ∩tB
i



1
nA
ijnB
ij
nA
ij
k=1
nB
ij
l=1
S(TA
ijk, TB
ijl)



−β



j∈(tA
i −tB
i )
nA
ij +
j∈(tB
i −tA
i )
nB
ij






+
A
i= B+1
wi(−βnA
i ) (2)
where S(T1, T2) notiﬁes the similarity between two subtrees T1, T2
which is computed by Equation (1). In Equation (2, nA
ij = {TA
ijk}∀k
denotes the number of subtrees whose root taxonomy identiﬁer is j in the
ith layer of EPT A and nA
i = j∈tA
i
nA
ij is the number of all the subtrees
in the ith layer of EPT A. Therefore, the ﬁrst term compares subtrees
having the same root taxonomy identiﬁers, the second term is the penalty
for subtrees in one layer of an EPT and not in the corresponding layer of the
other EPT and the third term is the penalty for missing layers. In this study,
we set wi = 1/i, meaning that the layers are assigned weights depending
on their lineage. Finally the scores are normalized as follows
SAB =
SAB − min{SAB}A,B
max{SAB}A,B − min{SAB}A,B
. (3)
2.6 Predicting Functional Linkages of Proteins
A protein is characterized by the EPT of the family to which it belongs. Two
similar EPTs mean correlated patterns of evolution and, by implication, a
functional linkage. We construct EPTs for all the families that contain more
than one member and at least one of the members is a protein of the target
organism. Two proteins whose EPTs have a higher similarity score than a
given threshold are classiﬁed as functionally linked or physically interacting
proteins.
3
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom
Fig. 3. The topological comparison of two trees is based on the taxonomy
identiﬁers of the nodes in two trees. The numbers labeling the nodes in the
trees are taxonomy identiﬁers. The similarity is recursively calculated by
Equation 1. In this example, there are ﬁve common nodes (black) and three
subtree deletions (dashed subtrees).
2.7 Calculating GO Semantic Similarity
To investigate the biological features of predicted functionally linked protein
pairs, we quantify the functional similarity for these pairs by computing the
semantic similarity (SS) score (Wang et al., 2007) of the GO terms with
which these proteins are annotated. The semantic score takes into account
both the level of hierarchy of GO terms and also their relations with their
ancestor terms. All predicted protein pairs are assigned semantic similarity
scores for three GO branches: molecular function (MF); biological process
(BP) and cellular component (CC). Semantic scores in this study were
calculated using GOSemSim package in Bioconductor (Gentleman et al.,
2004).
Functional linkages are predicted among all the proteins in yeast and
human proteomes by all the competing methods (binary proﬁles, occurrence
proﬁles, and EPT). The mean GO semantic score is calculated over all the
protein pairs in the top predictions to present the biological features of the
predicted pairs. The method that shows higher mean GO-SS at the same
number of top predictions has stronger biological implications.
3 RESULTS AND DISCUSSION
3.1 Evaluation of Performance of the EPT Method at
Predicting Functional Linkages
Benchmark datasets are tested by the enhanced phylogenetic
tree method. We compare our method with the conventional
methods using binary proﬁle with Pearson correlation (Glazko and
Mushegian, 2004) (called bin ps) and occurrence proﬁle (Ranea
et al., 2007) with Pearson correlation (called occ ps) or Euclidean
distance (called occ ed) (see Supplementary Methods ). The binary
and occurrence proﬁles are converted from the corresponding EPTs
as described in Figure 2A. Figure 4 shows the assessment of the EPT
method for different test sets including human (see Figure 4A) and
yeast (see Figure 4B) PRS and NRS of functional linkages. It is clear
that the EPT method can capture part of the co-evolutionary signal
of protein functional linkages. In particular, by using a threshold
of 0.245 (0.546), the EPT method discovers about 20% (27%)
functional linkages in human (yeast) datasets with a low level of
false positive rate (around 5%) (see Figure 4 and Figure S2 in the
supplementary material). For the top 1000 predictions, we obtained
a precision of more than 90% both in the human and in the yeast
datasets (see Insets in Figure 4).
The EPT method shows a drastic improvement compared to
traditional approaches using binary or occurrence proﬁles. In the
Fig. 4. ROC curves of the EPT (blue), occ ed (green), occ ps (magenta)
and bin ps (black) methods for human (A) and yeast (B) datasets. The
dashed gray diagonal lines are corresponding to random predictions. The
ROC curves with full scales are presented in Figure S1 in the supplementary
data ﬁle. The insets of (A) and (B) are the precisions of the methods at
different number of top predictions (cases with highest scores) for human
and yeast datasets, respectively. The precision is the ratio of the number of
true positives over the number of top predictions, TP/(TP+FP).
top 1000 predictions of human datasets, the EPT method shows a
precision of more than 90% while others show precisions less than
70%. For yeast datasets, the precision of the EPT method is about
90% while those of other methods are about 60%. The advance of
the EPT method seems to be a direct consequence of using the EPT
as an evolutionary pattern. It is clear that the similarity between two
EPTs reﬂects the co-evolution of the pair of proteins belonging to
these EPTs.
Clearly, the EPT method helps to reduce FPs by other methods
using binary and occurrence proﬁles (see Figure S3). In the top
1000 human predictions by the bin ps method, there are 343 FPs
(> 34%). The number of FPs in the top 1000 predictions by the
occ ed and occ ps methods are 289 (≈ 29%) and 378 (≈ 38%),
respectively. Nearly the FPs by the traditional methods are out of
the top 1000 predictions by the EPT method, meaning that they are
likely identiﬁed as TNs by the EPT methods. For yeast datasets,
most of the FPs in the top 1000 predictions by the traditional
methods is also identiﬁed as TNs by the EPT method. Namely,
the EPT method ranks 404 among 418 FPs (≈ 97%) by the bin ps
method out of the top 1000 predictions. Our method also identiﬁed
420 /425 FPs (259/274 FPs ) by the occ ed (occ ps) method as TNs.
Two examples of potential FPs by the traditional methods that
are identiﬁed as TNs by the EPT methods are presented in
Figure 5A for human datasets and Figure 5B for yeast datasets.
The upper and lower panels of Figure 5A, respectively, show
the EPTs, occurrence and binary proﬁles of negative elongation
factor A protein (Q9H3P2, NELFA HUMAN) and syntaxin-12
protein (Q86Y82, STX12 HUMAN) which belong to different
complexes (Ruepp et al., 2010). They seem to participate in different
biological processes, implement different molecular functions and
locate in different cellular components (Ashburner et al., 2000).
iHOP shows no evidence for functional associations or physical
interactions between these two proteins (Hoffmann and Valencia,
2004). NELFA HUMAN and STX12 HUMAN share the identical
binary proﬁles and very similar occurrence proﬁles, making them
to be in the top predictions by traditional methods. However, their
EPTs are topologically different. There are about 55% of all the
protein pairs that have higher EPT scores than this pair does,
4
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom
Fig. 5. Examples of potential FPs by the traditional methods that are TNs by the EPT method. The EPTs, the corresponding occurrence and binary proﬁles
of protein Q9H3P2 (upper) and Q86Y82 (below) in human (A) and P12945 and P32602 in yeast (B). The grey (red) numbers are the taxonomy identiﬁers of
missing (present) proteins. The grey characters are the taxonomy names of the roots of the missing subtrees. The speciation, duplication and missing edges are
represented by solid black, red and dashed edges.
meaning that NELFA HUMAN and STX12 HUMAN are identiﬁed
as unlikely to be functionally linked by the EPT method. Figure 5B
shows a protein pair, N-terminal acetyltransferase A complex
subunit protein (P12945, NAT1 YEAST) and Alpha-soluble NSF
attachment protein (P32602, SEC17 YEAST), in yeast that is a TN
by the EPT method but a FP by traditional methods using binary and
occurrence proﬁles. These above examples indicate that the use of
EPTs provides a signiﬁcant increase of the precision and sensitivity
for co-evolution signal detection.
3.2 Biological Feature of Predicted Protein Functional
Linkages
The relationship between the prediction methods and Gene
Ontology (GO) attributes (Ashburner et al., 2000) is shown in
Figure 6. For biological process (BP) and cellular component (CC),
the average GO semantic score (GO-SS) of the top predictions
by the EPT method is higher than that of the top predictions by
other methods (see Figure 6B and 6C for yeast and Figure 6E
and 6F for human), indicating that the EPT method is able to
ﬁnd functional linkages among proteins that participate in the same
biological processes or locate in the same cellular components.
Figure 6A and 6D show that the linkages predicted by the EPT
method exist between proteins which do not perform the same
molecular functions. This is as expected because typically a set of
different biochemical activities (molecular functions) are required
to implement biological processes.
3.3 Examples of Predicted Functional Linkages Linking
Cancer to DNA Repair Complexes
Table S1 and Table S2 in the suplementary material show several
pairs of proteins that participate in the same biological process in
Fig. 6. GO-SS for molecular function (A, D), biological process (B, E) and
cellular component (C, F) of yeast and human prediction, respectively. The
x axes represent the number of top predictions (cases with highest prediction
scores) by the EPT method (blue), occ ed (green), occ ps (magenta) and
the bin ps (black) and the y axes represent the average GO-SS of the top
predictions.
yeast and human. These pairs have very high EPT scores so they
are predicted as likely functional linkages by the EPT methods. A
number of these predicted linkages are conﬁrmed by experimental
assays. The others can be considered as “novel” linkages which
might be candidates for further experimental validation.
An interesting example of yeast functional linkages predicted
by the EPT method is the linkage between the large (alpha) and
small (beta) subunits of yeast phenylalanyl-tRNA synthetase (Genes
FRS1 and FRS2, Uniprot accession numbers P15625 and P15624 ,
5
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom
respectively). The association among these two proteins has been
found by many experimental assays using Afﬁnity Capture-MS
methods (Snitkin et al., 2006; Gavin et al., 2002, 2006; Krogan
et al., 2006; Collins et al., 2007). Among 6,931,774 yeast scored
protein pairs, the EPT method ranks the pair of P15625 and P15624
as the 13th
, indicating that this pair is likely identiﬁed as a TP by the
EPT method. However, this pair is out of top 100,000 predictions by
any conventional methods using either occurrence or binary proﬁles
(top 7% for occ ed, top 32% for occ ps and top 50% for bin ps),
meaning that this protein pair is potentially identiﬁed as a FN.
As an example of human predicted functional linkages by the EPT
method, we predict several interaction partners for DNA-directed
polymerase theta (POLQ, Uniprot accession number Q59EE4).
The predicted interaction partners of POLQ include proliferating
cell nuclear antigen (PCNA, P12004), replication factor C (RFC4,
P35249), and structural maintenance of chromosomes protein
1A (SMC1A, Q14683). Among more than 109
scored pairs of
human proteins, the pair POLQ-PCNA is ranked 403rd
. These
two proteins are both mapped to DNA replication process. There
has not been any direct evidence by experimental assays on
the relationship between proliferating cell nuclear antigen and
polymerase (DNA directed) theta protein. However, there exist
associations between PCNA and other DNA polymerases, namely
DNA polymerase kappa protein (Haracska et al., 2002; Maga
et al., 2002; Shimazaki et al., 2002), polymerase (DNA directed)
delta protein Liu et al. (2003); Ohta et al. (2002); Pohler et al.
(2005); Ducoux et al. (2001), and DNA polymerase eta protein
(POLH, Q9Y253). POLH is speciﬁcally involved in DNA repair.
DNA polymerase kappa and delta require activator 1 (alias RFC1-
5) in addition to PCNA. The association between POLQ and
PCNA is a very promising target for experimental validation. The
second of POLQ’s predicted interaction partners, RFC4, is part
of activator complex 1 (composed of RFC1-5), which is known
to bind to PCNA (Ohta et al., 2002; Cai et al., 1997). The
third predicted interaction partner, SMC1A, is linked to disease.
SMC1A is a central component of the cohesin complex, which
apparently forms a large proteinaceous ring within which sister
chromatids can be trapped after DNA replication. The cohesin
complex interacts with a number of other proteins, including breast
cancer associated BRCA1. Defects in SMC1A are the cause of
Cornelia de Lange syndrome type 2 (CDLS2)[MIM:300590]; also
known as Cornelia de Lange syndrome X-linked. CDLS is a
clinically heterogeneous developmental disorder associated with
malformations affecting multiple body parts. Mutated Cornelia
de Lange cell lines display genomic instability and sensitivity to
ionizing radiation and interstrand cross-linking agents (Apweiler
et al., 2004). Now that is an interesting piece of information,
because POLQ has been implicated in the repair of interstrand
cross-links (Apweiler et al., 2004). So we hypothesize that SMC1A
mutations manifest a radiation sensitive phenotype because of a
disrupted physical association of the cohesin complex with POLQ.
4 CONCLUSIONS
All the phylogeny based methods for predicting functional protein
linkages are based on the common observation of the similarity
between evolutionary patterns of interacting proteins. Therefore,
the performance of these methods strongly depends on how the
evolutionary patterns of proteins are described.
In prokaryotes, protein families typically contain one copy per
species (Ranea et al., 2007). This facilitates the prediction of
functional linkages of protein using binary proﬁles. However, binary
phylogenetic proﬁles are not evolutionarily informative enough for
protein families in eukaryotic genomes, many of which contain
more protein homologues per species. Occurrence proﬁles partly
overcome the problem with multi-protein families in eukaryotes
by counting the number of protein copies per species. Enhanced
phylogenetic trees, which are graphical models of the evolution
of proteins by speciation and duplication events, add richer
information about the evolutionary history of proteins and help to
overcome the problems which the conventional proﬁles encounter.
The EPT algorithm creates groups of proteins using the reciprocal
best BLAST criterion that is commonly used to detect orthologs.
The orthodox way to deﬁne orthologs is a subject of hot debate
(Ouzounis, 1999; Koonin, 2001); we note that any protein family
classiﬁcation can be used for phylogenetic proﬁling and perfect
recognition of orthologs is not necessary for our prediction
purposes.
The current EPT method produces a simple clustering of proteins
based on BLAST scores. The method can be improved in a
number of ways to take into account the complexity of evolutionary
relations including divergence, convergence, domain recombination
and horizontal gene transfer events. For example, the choice of the
descendant protein is somewhat arbitrary as BLAST scores do not
reﬂect the complexity of these relations. We are going to address this
problem in the future by using synteny information for complete
genomes to maximize the number of syntenic protein pairs within
layers of an EPT. Protein sequence proﬁles (e.g. HMMER (Eddy,
1998)) would be better representatives of the ancestral sequences
than a single representative as in the current version. The present
version of the version only models gene loss whereas inferring de
novo creation or horizontal gene transfer requires post-processing
of the trees. The EPT algorithm was designed so that singletons, as
any other protein, are propagated to ancestral species for possible
merging with clusters coming from other lineages. We deﬁne the
ancestral taxon of an EPT as the smallest taxonomic node that
encompasses all non-zero instances of the leaf proteins. In the
future, we will test setting nodes to zero above the ancestral taxon,
as well as modelling possible horizontal gene transfer events (which
generate more than one origins of extant protein groups in the EPT).
Like other phylogeny based methods, enhanced phylogenetic tree
method is limited at inferring physical interactions of proteins but
is promising at predicting functionally linked proteins. Functionally
linked proteins were benchmarked against GO semantic similarity.
Unlike domain-based approaches where domains, e.g. binding
domains, are believed to perform the same molecular function
but may occur in proteins that participate in very different
biological processes, the thinking behind such a per-protein
model as the EPT is that orthologs will be performing the same
biological process across species. There are tools available which
combine evidence - often weak and always noisy - from many
different types of experimental and computational data to make
integrated predictions of functional linkages (e.g. String, FunCoup).
Examination of our top predictions indicated that they were often
complementary to these existing tools, but had strong evidence
from co-expression data. We believe that the EPT method would
6
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom
be a valuable, orthogonal addition to such integrative tools. Our
method signiﬁcantly surpasses conventional methods in prediction
performance and potentially discovers more reliable functional
linkages.
ACKNOWLEDGEMENT
We thank the members of Bioinformatics group for discussion and
Alexander Goesmann of the University of Bieleﬁeld for providing
recourses for computation.
Funding: Marie Curie grant (MRTN-CT-2005-019475).
REFERENCES
Apweiler, R., Bairoch, A., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S.,
Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A.,
O’Donovan, C., Redaschi, N., and Yeh, L.-S. L. (2004). UniProt: the Universal
Protein knowledgebase. Nucl. Acids Res., 32, D115–119.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis,
A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., IsselTarver,
L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M.,
Rubin, G. M., and Sherlock, G. (2000). Gene ontology: tool for the uniﬁcation of
biology. Nat Genet, 25(10), 25–29.
Barker, D. and Pagel, M. (2005). Predicting functional gene links from phylogeneticstatistical
analyses of whole genomes. PLoS Comput Biol, 1(1), e3.
Barker, D., Meade, A., and Pagel, M. (2007). Constrained models of evolution lead
to improved prediction of functional linkage from correlated gain and loss of genes.
Bioinformatics, 23(1), 14–20.
Cai, J., Gibbs, E., Uhlmann, F., Phillips, B., Yao, N., ODonnell, M., and Hurwitz, J.
(1997). A Complex Consisting of Human Replication Factor C p40, p37, and p36
Subunits Is a DNA-dependent ATPase and an Intermediate in the Assembly of the
Holoenzyme. Journal of Biological Chemistry, 272(30), 18974–18981.
Collins, S. R., Kemmeren, P., Zhao, X.-C., Greenblatt, J. F., Spencer, F., Holstege,
F. C. P., Weissman, J. S., and Krogan, N. J. (2007). Toward a Comprehensive
Atlas of the Physical Interactome of Saccharomyces cerevisiae. Molecular Cellular
Proteomics, 6(3), 439–450.
Dandekar, T., Snel, B., Huynen, M., and Bork, P. (1998). Conservation of gene order:
a ﬁngerprint of proteins that physically interact. Trends in Biochemical Sciences,
23(9), 324 – 328.
Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems
by genome-wide analysis of functional linkages. Nat Biotech, 21, 1055–1062.
Ducoux, M., Urbach, S., Baldacci, G., Hbscher, U., Koundrioukoff, S., Christensen, J.,
and Hughes, P. (2001). Mediation of Proliferating Cell Nuclear Antigen (PCNA)dependent
DNA Replication through a Conserved p21Cip1-like PCNA-binding
Motif Present in the Third Subunit of Human DNA Polymerase . Journal of
Biological Chemistry, 276(52), 49258–49266.
Dutkowski, J. and Tiuryn, J. (2009). Phylogeny-guided interaction mapping in seven
eukaryotes. BMC Bioinformatics, 10(1), 393.
Eddy, S. R. (1998). Proﬁle hidden Markov models. Bioinformatics, 14(9), 755–763.
Eisenberg, D., Marcotte, E. M., Xenarios, I., and Yeates, T. O. (2000). Protein function
in the post-genomic era. Nature, 405, 823–826.
Enright, A. J., Iliopoulos, I., Kyrpides, N. C., and Ouzounis, C. A. (1999). Protein
interaction maps for complete genomes based on gene fusion events. Nature, 402,
86–90.
Gavin, A.-C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz,
J., Rick, J. M., Michon, A.-M., Cruciat, C.-M., Remor, M., Hofert, C., Schelder,
M., Brajenovic, M., Ruffner, H., Merino, A., Klein, K., Hudak, M., Dickson, D.,
Rudi, T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B., Leutwein, C., Heurtier, M.A.,
Copley, R. R., Edelmann, A., Querfurth, E., Rybin, V., Drewes, G., Raida, M.,
Bouwmeester, T., Bork, P., Seraphin, B., Kuster, B., Neubauer, G., and SupertiFurga,
G. (2002). Functional organization of the yeast proteome by systematic
analysis of protein complexes. Nature, 415(6868), 141–147.
Gavin, A.-C., Aloy, P., Grandi, P., Krause, R., Boesche, M., Marzioch, M., Rau, C.,
Jensen, L. J., Bastuck, S., Dumpelfeld, B., Edelmann, A., Heurtier, M.-A., Hoffman,
V., Hoefert, C., Klein, K., Hudak, M., Michon, A.-M., Schelder, M., Schirle, M.,
Remor, M., Rudi, T., Hooper, S., Bauer, A., Bouwmeester, T., Casari, G., Drewes,
G., Neubauer, G., Rick, J. M., Kuster, B., Bork, P., Russell, R. B., and SupertiFurga,
G. (2006). Proteome survey reveals modularity of the yeast cell machinery.
Nature, 440(7084), 631–636.
Gentleman, R., Carey, V., Bates, D., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B.,
Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S.,
Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A., Sawitzki, G., Smith, C.,
Smyth, G., Tierney, L., Yang, J., and Zhang, J. (2004). Bioconductor: open software
development for computational biology and bioinformatics. Genome Biology, 5(10),
R80.
Glazko, G. and Mushegian, A. (2004). Detection of evolutionarily stable fragments of
cellular pathways by hierarchical clustering of phyletic patterns. Genome Biol, 5(5),
R32.
Haracska, L., Unk, I., Johnson, R. E., Phillips, B. B., Hurwitz, J., Prakash, L.,
and Prakash, S. (2002). Stimulation of DNA Synthesis Activity of Human DNA
Polymerase kappa by PCNA. Mol. Cell. Biol., 22(3), 784–791.
Hoffmann, R. and Valencia, A. (2004). A gene network for navigating the literature.
Nat Genet, 36, 664.
Huynen, M., Snel, B., Lathe, W., and Bork, P. (2000). Predicting Protein Function by
Genomic Context: Quantitative Evaluation and Qualitative Inferences. Genome Res,
10(8), 1204–1210.
Koonin, E. (2001). An apology for orthologs - or brave new memes. Genome Biology,
2(4), comment1005.1–comment1005.2.
Krogan, N. J., Cagney, G., Yu, H., Zhong, G., Guo, X., Ignatchenko, A., Li, J., Pu,
S., Datta, N., Tikuisis, A. P., Punna, T., Peregrn-Alvarez, J. M., Shales, M., Zhang,
X., Davey, M., Robinson, M. D., Paccanaro, A., Bray, J. E., Sheung, A., Beattie,
B., Richards, D. P., Canadien, V., Lalev, A., Mena, F., Wong, P., Starostine, A.,
Canete, M. M., Vlasblom, J., Wu, S., Orsi, C., Collins, S. R., Chandran, S., Haw,
R., Rilstone, J. J., Gandi, K., Thompson, N. J., Musso, G., St Onge, P., Ghanny, S.,
Lam, M. H. Y., Butland, G., Altaf-Ul, A. M., Kanaya, S., Shilatifard, A., O’Shea,
E., Weissman, J. S., Ingles, C. J., Hughes, T. R., Parkinson, J., Gerstein, M., Wodak,
S. J., Emili, A., and Greenblatt, J. F. (2006). Global landscape of protein complexes
in the yeast saccharomyces cerevisiae. Nature, 440(7084), 637–643.
Liu, L., Rodriguez-Belmonte, E. M., Mazloum, N., Xie, B., and Lee, M. Y. W. T.
(2003). Identiﬁcation of a Novel Protein, PDIP38, That Interacts with the p50
Subunit of DNA Polymerase and Proliferating Cell Nuclear Antigen. Journal of
Biological Chemistry, 278(12), 10041–10047.
Maga, G., Villani, G., Ramadan, K., Shevelev, I., Le Gac, N. T., Blanco, L.,
Blanca, G., Spadari, S., and Hbscher, U. (2002). Human DNA Polymerase
Functionally and Physically Interacts with Proliferating Cell Nuclear Antigen in
Normal and Translesion DNA Synthesis. Journal of Biological Chemistry, 277(50),
48434–48440.
Marcotte, E. M., Pellegrini, M., Ng, H.-L., Rice, D. W., Yeates, T. O., and Eisenberg, D.
(1999). Detecting Protein Function and Protein-Protein Interactions from Genome
Sequences. Science, 285(5428), 751–753.
Marcotte, E. M., Xenarios, I., van der Bliek, A. M., and Eisenberg, D. (2000).
Localizing proteins in the cell from their phylogenetic proﬁles. Proc Natl Acad
Sci U S A, 97(22), 12115–12120.
Mewes, H. W., Amid, C., Arnold, R., Frishman, D., Guldener, U., Mannhaupt, G.,
Munsterkotter, M., Pagel, P., Strack, N., Stumpﬂen, V., Warfsmann, J., and Ruepp,
A. (2004). MIPS: analysis and annotation of proteins from whole genomes. Nucl.
Acids Res., 32, D41–44.
O’Brien, K. P., Remm, M., and Sonnhammer, E. L. L. (2005). Inparanoid: a
comprehensive database of eukaryotic orthologs. Nucl. Acids Res., 33, D476–480.
Ohta, S., Shiomi, Y., Sugimoto, K., Obuse, C., and Tsurimoto, T. (2002). A Proteomics
Approach to Identify Proliferating Cell Nuclear Antigen (PCNA)-binding Proteins
in Human Cell Lysates: IDENTIFICATION OF THE HUMAN CHL12/RFCs25
COMPLEX AS A NOVEL PCNA-BINDING PROTEIN. Journal of Biological
Chemistry, 277(43), 40362–40367.
Ouzounis, C. (1999). Orthology: another terminology muddle. Trends in Genetics, 15,
445.
Overbeek, R., Fonstein, M., DSouza, M., Pusch, G. D., and Maltsev, N. (1999). The use
of gene clusters to infer functional coupling. Proceedings of the National Academy
of Sciences of the United States of America, 96(6), 2896–2901.
Pazos, F. and Valencia, A. (2001). Similarity of phylogenetic trees as indicator of
protein-protein interaction. Protein Eng, 14(9), 609–614.
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D., and Yeates, T. O.
(1999). Assigning protein functions by comparative genome analysis: Protein
phylogenetic proﬁles. Proceedings of the National Academy of Sciences of the
United States of America, 96(8), 4285–4288.
Pohler, J. R., Otterlei, M., and Warbrick, E. (2005). An in vivo analysis of the
localisation and interactions of human p66 dna polymerase delta subunit. BMC
7
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom
Molecular Biology, 6(1), 17.
Ranea, J. A. G., Yeats, C., Grant, A., and Orengo, C. A. (2007). Predicting protein
function with hierarchical phylogenetic proﬁles: The gene3d phylo-tuner method
applied to eukaryotic genomes. PLoS Comput Biol, 3(11), e237.
Remm, M., Storm, C. E., and Sonnhammer, E. L. (2001). Automatic clustering of
orthologs and in-paralogs from pairwise species comparisons. Journal of Molecular
Biology, 314(5), 1041 – 1052.
Ruepp, A., Waegele, B., Lechner, M., Brauner, B., Dunger-Kaltenbach, I., Fobo,
G., Frishman, G., Montrone, C., and Mewes, H.-W. (2010). CORUM: the
comprehensive resource of mammalian protein complexes–2009. Nucl. Acids Res.,
38, D497–501.
Shimazaki, N., Yoshida, K., Kobayashi, T., Toji, S., Tamai, K., and Koiwai, O. (2002).
Over-expression of human DNA polymerase lambda in E. coli and characterization
of the recombinant enzyme. Genes to Cells, 7(7), 639–651.
Snitkin, E., Gustafson, A., Mellor, J., Wu, J., and DeLisi, C. (2006). Comparative
assessment of performance and genome dependence among phylogenetic proﬁling
methods. BMC Bioinformatics, 7(1), 420.
Sprinzak, E. and Margalit, H. (2001). Correlated sequence-signatures as markers of
protein-protein interaction. Journal of Molecular Biology, 311(4), 681 – 692.
Ta, H. X. and Holm, L. (2009). Evaluation of different domain-based methods in protein
interaction prediction. Biochemical and Biophysical Research Communications,
390(3), 357 – 362.
Vert, J.-P. (2002). A tree kernel to analyse phylogenetic proﬁles. Bioinformatics, 18,
S276–284.
Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S., and Chen, C.-F. (2007). A new method to
measure the semantic similarity of GO terms. Bioinformatics, 23(10), 1274–1281.
Wu, J., Kasif, S., and DeLisi, C. (2003). Identiﬁcation of functional links between
genes using phylogenetic proﬁles. Bioinformatics, 19(12), 1524–1530.
Zmasek, C. and Eddy, S. (2002). Rio: Analyzing proteomes by automated
phylogenomics using resampled inference of orthologs. BMC Bioinformatics, 3(1),
14.
8
atMasarykUniversityonOctober4,2011bioinformatics.oxfordjournals.orgDownloadedfrom