R E V I E W
Machine learning techniques for protein function prediction
Rosalin Bonetta1
| Gianluca Valentino2
1
Centre for Molecular Medicine and
Biobanking, University of Malta, Msida, Malta
2
Department of Communications and
Computer Engineering, University of Malta,
Msida, Malta
Correspondence
Rosalin Bonetta, Centre for Molecular
Medicine and Biobanking, University of Malta,
Msida MSD2080, Malta.
Email: rosalin.bonetta@um.edu.mt
Peer Review
The peer review history for this article is
available at https://publons.com/publon/10.
1002/prot.25832.
Abstract
Proteins play important roles in living organisms, and their function is directly linked
with their structure. Due to the growing gap between the number of proteins being
discovered and their functional characterization (in particular as a result of experimental
limitations), reliable prediction of protein function through computational
means has become crucial. This paper reviews the machine learning techniques used
in the literature, following their evolution from simple algorithms such as logistic
regression to more advanced methods like support vector machines and modern
deep neural networks. Hyperparameter optimization methods adopted to boost prediction
performance are presented. In parallel, the metamorphosis in the features
used by these algorithms from classical physicochemical properties and amino acid
composition, up to text-derived features from biomedical literature and learned feature
representations using autoencoders, together with feature selection and dimensionality
reduction techniques, are also reviewed. The success stories in the
application of these techniques to both general and specific protein function prediction
are discussed.
K E Y W O R D S
deep learning, feature selection, machine learning, protein function prediction
1 | INTRODUCTION
Proteins are made up from 20 different types of amino acids, which
occur in nature and are encoded by DNA sequences. Proteins perform
essential roles in the cells of organisms. These include cell signaling,
regulation, recognition, catalysis of reactions, membrane transport,
and the provision of structure. The function performed by a protein
depends on its structure, which is indirectly, a result of its DNA
sequence.
A classical view of protein function focuses on the action of a single
protein molecule. For example, the catalysis of a given reaction or
the binding of a molecule, which may be small or large. Today this
local function is occasionally termed the “molecular function” of the
protein, such as to distinguish it from the expanded view of function
(Figure 1). A protein is defined as an element in the network of its
interactions in the case of an expanded view of protein function.
Numerous terms such as “contextual function” or “cellular function,”
have been coined for this expanded view of function.2
The idea
conveyed is that each protein plays a role in an extended network of
interacting molecules. Therefore, a function can be thought of as
“anything that happens to or through a protein”.3
The extent to which a protein's function is altered upon mutating
an amino acid depends on the type and position of the amino acid that
is mutated, for example whether the amino acid is found in an enzyme
active site. Thus, numerous mutations may affect protein function in a
complicated manner, and are therefore, difficult to predict. Due to limitations
imposed by experimental methods,4
predicting protein function
by computational means has become crucial. Protein functions
can be described at different levels of complexity, which include cellular,
biochemical, physiological, and phenotypic levels. In addition, protein
function may be defined in a hierarchical manner. For instance, at
a high level, superoxide dismutase is an oxidoreductase, while at a
lower level, it converts superoxide radicals into hydrogen peroxide
and molecular oxygen.
Gene Ontology (GO) terms offer an accurate description of the
several levels of protein function.5
It is vital to comprehend that the
molecular or biochemical function of a protein is demonstrated via
Received: 19 March 2019 Revised: 5 July 2019 Accepted: 17 September 2019
DOI: 10.1002/prot.25832
Proteins. 2020;88:397–413. wileyonlinelibrary.com/journal/prot © 2019 Wiley Periodicals, Inc. 397
sequence and/or structural data. Therefore, in silico approaches can
aid in the prediction of protein function.6
As discussed by Lee et al,
there are different interdependent levels of protein function, which
may be divided into three major types of GO categories: molecular
function, biological process, and cellular component (Figure 2).7
Molecular function refers to activity at the molecular level (eg, catalysis),
and is commonly predicted through computational methods,
which identify homologues or orthologues. Biological process
describes broader functions, which are performed by assemblies of
molecular functions, such as a particular metabolic pathway. Genomic
inference methods can identify the direct physical protein-protein
interactions and indirect functional associations, which are found in
biological processes. Finally, cellular component describes the
location(s) within a cell in which the protein performs its function. Prediction
of protein subcellular localization is an important component
of bioinformatics based prediction of protein function and genome
annotation, as it can aid the identification of drug targets.8
This component
can be predicted through methods that predict signal
sequences, residue composition, membrane association, or posttranslational
modifications.
Protein information is stored in several databases, such as
UniProt,9
which is the leading protein sequence database or Pfam,
which is a database of protein function families, for which the protein
sequence is known but the function is unknown.10
The gap between
the amount of protein sequences and the functional annotations has
been growing continuously (Figure 3). There is an order of magnitude
more of protein sequences today than 10 years ago in the UniProt
Knowledgebase (UniProtKB). However, the number of manually
annotated and reviewed protein sequences (UniProtKB/SwissProt)
has only marginally increased.
Therefore, a main challenge in bioinformatics involves predicting
the role played by proteins in biological processes and disease, as well
as predicting mechanisms by which such functions are performed. As
new algorithms are developed to address these questions, it is essential
to evaluate the performance of these different function prediction
algorithms with respect to more traditional, manual methods. The bioinformatics
community has sought to address the problem of automated
protein function prediction through initiatives such as the
Critical Assessment of Function Annotation (CAFA) challenge.11
This
is an experiment designed to provide large-scale assessment of computational
methods used to predict protein function.
Since more than a decade ago, researchers have used or machine
learning techniques to derive sequence-function relationships.
Machine learning models of protein function have shown to provide
good predictive performance, even when the underlying mechanisms
were not well understood. Bernardes et al documented the growing
critical mass of literature in which machine learning techniques were
used to predict protein function in their review paper.12
However, following
the trend in other domains, besides the use of established
methods like random forests, support vector machines (SVM) and
neural networks, the use of deep learning has also caught on, with
impressive results. Deep learning is well suited to big data problems,
and is now within reach due to the rapid evolution in computational
performance. Therefore, we extend the review of the literature
beyond the one performed by Bernardes et al in 2013 to include
novel sources of features and deep learning approaches, among
others. Other reviews have focused on specific taxonomies and ontologies,
such as enzyme functional class prediction13
and subcellular
localization,14
whereas this review is intended to be more comprehensive
to cover a wide array of features and techniques which may be
interchangeable across different taxonomies.
The notion of protein function and a recapitulation of the existing
techniques used for function prediction were already provided in this
introduction. The next part of this review presents protein function
FIGURE 1 The evolution of the meaning of protein function. The
traditional view is illustrated on the left, and the post-genomic view
on the right. Adapted from Reference 1 [Color figure can be viewed at
wileyonlinelibrary.com]
FIGURE 2 Classification of protein function according to GO:
molecular function, biological process, and cellular component [Color
figure can be viewed at wileyonlinelibrary.com]
FIGURE 3 Number of sequences deposited and experimentally
validated in UniProtKB over the past decade. The drop observed
between 2015 and 2016 is due to procedures deployed by curators
to identify and remove redundant proteomes
398 BONETTA AND VALENTINO
prediction as a problem which can be targeted using machine learning
techniques. These techniques range from the generation and selection
of suitable features, to algorithms and models, which can be trained to
perform this task. In addition, the applications of these techniques to
general and specific function prediction are also discussed. This
review concludes with the future perspectives for these techniques in
this domain.
2 | MACHINE LEARNING TECHNIQUES FOR
PROTEIN FUNCTION PREDICTION
2.1 | Feature engineering and representation
The inputs to a predictive model, which is trained using machine
learning techniques pertinent to a particular object, in this case a protein,
are known as features. A key step in applying machine learning
to any application is identifying suitable features. This can allow the
model to discriminate between one category of data and another in a
classification problem, or fit a suitable function to some data in a
regression problem. Generating suitable features is also known as feature
engineering. A group of features representing one particular
object is known as a feature vector, while the n-dimensional space
associated with the feature vector is termed the feature space.
Typical protein features include amino acid sequences, physicochemical
properties, and protein-protein interactions. Amino acid
sequences can be used to derive parameters such as amino acid composition,
which refers to the occurrence of amino acids in a particular
sequence; amino acid transition, which represents the frequency with
which specific amino acid types are followed or preceded by other
amino acid types within the sequence; and amino acid distribution,
which captures the dissemination of specific amino acid types within
specific portions of the sequence. A particular category of sequencebased
features is the sequence motif, which consists of an amino acid
sequence pattern, which is widespread, and is thought to have a certain
biological significance. Therefore, the presence or absence of a
particular sequence motif can be used as a binary feature. N-terminal
targeting sequences have also been used as features.15,16
Sequencerelated
features such as Auto Covariance, Conjoint triad, local descriptor,
and Moran autocorrelation have proved useful in mining interaction
information in the sequence.17
Physicochemical properties of protein residues include isoelectric
points, molecular weights, polarity, hydrophobicity, normalized van
der Waals volume, polarity, extinction coefficients, polarizability,
charge, and surface tension. Protein-protein interaction (PPI) networks
are mathematical representations of the physical contacts
between proteins. The linkage-based assumption,18
also known as the
guilt-by-association rule, comes from the observation that immediate
neighbor proteins and level-2 neighbors have a high probability of
sharing functions. Therefore a protein's function could be determined
from the majority of its neighbors’ functions. In addition to considering
neighboring proteins, it is also common to consider the weights of
the interactions, which are proportional to the reliability of the experimental
sources. PPI tools such as Cytoscape,19
provide access to
further network features, such as average shortest path length, neighborhood
connectivity, radiality, and the topological coefficient. Features
can also be generated based on the overall Composition,
Transition and Distribution (CTD) of amino acid attributes such as
physicochemical properties, secondary structure, and solvent accessi-
bility.20
This feature vector was used to classify protein locations in
cellular sorting pathway.
After introducing the basic sources, we now discuss how features
can be better represented. The concept of protein granularity and the
possibility of extracting features was originally proposed in Reference
21. Protein granularity captures information about sequence-order
effects and amino acid composition. As machine learning algorithms
can only handle vectors, the Pseudo Amino Acid Composition
(PseAAC)22
was developed to formulate an amino acid sequence of
arbitrary length, such as a vector. A protein sequence with length
L amino acid residues R1R2R3…RL, where R1 represents the residue at
sequence position 1, R2 represents the residue at position 2 and so
on, may be denoted as a (20 + λ)-dimensional vector, defined by 20
+ λ discrete numbers, that is
X = : x1…x20x20 + 1…x20 + λ½  ð1Þ
The first 20 numbers above represent the classic amino acid composition,
while the next lambda discrete numbers reflect the effect of
sequence order.
The position-specific scoring matrix (PSSM) was first introduced
for detecting distantly related proteins.23
The original PSSM introduced
by Gribskov et al consists of the following components:
(a) position: indicates the sequentially increased index of each amino
acid residue in a sequence after multiple sequence alignment;
(b) probe: a group of typical sequences of functionally related proteins
that have been aligned by sequence or structural similarity; (c) profile:
a matrix consisting of 20 columns corresponding to 20 amino acids;
(d) consensus: a sequence of amino acid residues that are closest to all
of the alignment residues of probes at each position. It is generated
by selecting the highest score in the profile at each position.
Therefore, a PSSM for a given protein consists of a N × 20 matrix,
where N is the length of the protein sequence. It assigns a score Pij for
the jth
amino acid in the ith
position of the query sequence with a large
value indicating a highly conserved position, and a small value indicating
a weakly conserved position. However, as machine learning algorithms
typically require a fixed input size, the PSSMs need to be
processed further.
A systematic study of three different feature sets extracted using
PSSM was performed by Jeong et al.24
The first feature set consisted
of the averaged PSSM profiles over blocks, each with 5% of a
sequence. A protein sequence, regardless of length, is divided into
two blocks and each block consists of 20 features derived from the
20 columns in the PSSMs. In the second feature set, instead of considering
the locations of domains in a sequence, the authors focused on
the domains with similar conservation rates. In the third feature set,
the physicochemical properties of probed residues using original protein
sequences were considered. A total of nine physicochemical
BONETTA AND VALENTINO 399
properties, were categorized into two groups such as average and
density groups. Hydrophobicity, isoelectric point, and mess scale
were averaged, while hydrophobic, hydrophilic, polar, nonpolar, positive,
and negative charge residues were used for calculating densities.
Following training using machine learning models such as SVMs,
Random Forests, and decision trees, the second feature set was
found to be the most effective in protein function prediction. In Reference
25, the authors used protein granularity as one of the input
features.
Machine learning algorithms generally require numerical features
in order to develop a suitable model. While this is straightforward for
most sequence-, physiochemical- and PPI-derived features, it is also
possible to use text-based features if these are converted into a
numerical format. Advances in Natural Language Processing (NLP)
techniques have resulted in a greater exploitation of text-based features
for protein function prediction from biomedical literature, such
as abstracts or full-texts of journal articles.26
NLP techniques are also
well suited due to the nature of data storage in biological and biochemical
databases.27
These techniques were previously used in representing amino
acid sequences as text, and extracting features such as n-grams
and term frequency-inverse document frequency (TFIDF). An ngram
is a contiguous sequence of n items from a sequential
dataset, such as a protein sequence. In TFIDF, each document is
represented by a vector of all terms in a controlled vocabulary.
For each term in a document, a weight is calculated as the product
of the TF and IDF, where TF is its frequency in this document,
and IDF is its inverse document frequency in the full dataset of
documents. The basic idea of TFIDF is to emphasize the terms
with more occurrences in a document and less occurrences (more
discriminable) in the document dataset. Another representation is
document-to-vector, which is a dense, semantic representation
for documents.28
In NLP, text features are represented using vectors
and techniques such as Word2Vec.29
Asgari and Mofrad
describe how they developed ProtVec to represent amino acid
sequences.30
Text mining applied to bioinformatics literature has
been shown to be particularly useful in extracting protein-protein
interactions and extracting the relationship between gene functions
and diseases.
In the specific case of cellular component prediction, immunohistochemistry
images have also been used as features. From September
2018 to January 2019, a Kaggle competition31
was organized by the
Human Protein Atlas to bring together computer scientists and biologists
to identify protein locations from these images.
When using traditional machine learning algorithms and models,
typically, feature generation needs to be guided to some extent by
domain experts. However, deep learning algorithms have shown to be
capable of extracting the relevant and salient features from a given
input. Therefore, in this case the feature generation is said to be datadriven.
A typical example of data-driven feature generation is a neural
network autoencoder, which attempts to learn its own inputs. In this
case, the features are extracted from the output of the neurons in
middle of the network, and can then be used to train other
classifiers. A pictorial example of this architecture is shown in figure
of Reference 32. Some work has already been done to use
autoencoders as feature generations for protein function predic-
tion.32,33
A summary of typical features used to represent proteins for
functional classification is shown in Table 1.
2.2 | Feature selection
In several applications, such as biology, the usage of machine learning
techniques suffers from the curse of dimensionality, which means that
the feature space is so large that the available data become sparse,
and in turn, a performance degradation results. Therefore, this wealth
of information needs to be filtered out to obtain a final set of features,
which are suitable for the problem at hand. This step is known as
dimensionality reduction. Feature selection, in which subsets of the
original set of features are kept, is a special case of dimensionality
reduction.
Naively, one would test each possible subset of features, and select
the subset, which minimizes the error with respect to the ground truth.
However, this brute-force approach is computationally feasible for only
small feature sets. Molina et al hold that Feature Selection Algorithms
(FSAs) can be characterized as a search problem in the hypothesis
space (ie, space of candidate feature subsets) in terms of three aspects:
search strategy, which is the general strategy with which the space of
hypothesis is explored; generation of successor candidates (the mechanism
by which possible variants of the current hypothesis are proposed);
and evaluation measure, which is the function by which
successor candidates are evaluated, allowing to compare different
hypotheses to guide the search process.85
A general review of feature
selection in bioinformatics, with specific applications of these techniques
in sequence analysis, microarray analysis, and mass spectra analysis
is available in Reference 86. On the other hand, Wang et al87
categorize feature selection algorithms for big data bioinformatics into
exhaustive search, heuristic search, and hybrid methods.
Feature selection algorithms are generally classified into three
main categories: wrapper methods, filter methods, and embedded
methods. Wrapper methods evaluate candidate feature subsets by
using the same type of predictive model (eg, random forest or support
vector machines) that will be applied to the selected features later,
when the final classification model will be built. Each new feature subset
is used to train a model, which is tested on a hold-out set to obtain
an error-rate. As wrapper methods train a new model for each subset,
they are computationally intensive, but tend to provide the best performing
feature set for that particular model. Recursive Feature Elimination
(RFE) is an example of a wrapper method. The predictive
model is initially fitted with all available features, and the weakest feature
is then removed until a predetermined minimum number is
reached. Examples of work, which used RFE include References 69,
71. On the other hand, forward feature selection starts with the evaluation
of each individual feature and selects the one, which results in
the best performing model. Then, all possible combinations of that
selected feature and subsequent features are evaluated in order to
select a second feature. This is iteratively repeated until a maximum
400 BONETTA AND VALENTINO
number. Forward feature selection was used in Reference 72. In Reference
41, feature ranking was performed in the WEKA tool88
using
SVM as an evaluator.
Filter methods evaluate candidate feature subsets by using a
proxy measure instead of the error rate obtained by the algorithm to
be applied to the selected features later. This measure is chosen as it
is computationally inexpensive, but still captures the usefulness of the
feature set. Common examples include mutual information and the
Pearson product-moment correlation coefficient. The t test and ANalysis
Of VAriance (ANOVA) are two examples of univariate parametric
filter methods, while the Wilcoxon rank sum is an example of a univariate
model-free method. The ANOVA method was used by Tang et al
to rank 400 dipeptides, which were later used to train a SVM classifier
to predict differences between growth hormone binding proteins.89
Al-Shahib et al used a filter-based approach to select discriminatory
features. For each feature, the Wilcoxon signed-rank test was performed
for each comparison of functional classes.2
Features were
retained if for at least one comparison of classes a Wilcoxon P-value <
.02 was achieved, that is, they contribute potentially discriminating
information. A filter method called FrankSum was specifically developed
for protein function prediction.90
It uses a combination of the
Wilcoxon rank test P-value to measure the statistical significance of a
single feature in discriminating two functional classes, and correlation
coefficients to examine redundancy between features.
The Information Gain Ratio measure can also be used to rank features.
This was used in References 40, 43. XGBoost, a type of gradient
boosted tree algorithm, was used as a filter method in Reference 68
to select 32 GO features from an initial 21 000 features. In Reference
42, the authors evaluated the use of rough set theory as well as Correlation
Feature Selection, Fast Correlation-Based Filter and Artificial
Immune System as feature selection algorithms for classifying protein
function. In Reference 91, rough sets were used to rank the top
15 features from a feature set built based on compositional percentages
of the 20 amino acids properties. The Minimum Redundancy
Maximum Relevance (mRMR) feature selection algorithm92
is an
extension of maximum-relevance, in which the selected features are
those, which correlate strongest to the classification variable. As biological
data often contains relevant but redundant data, mRMR
attempts to address this problem by removing these redundant subsets.
Several works made use of mRMR43,51,76
for protein function
prediction.
Embedded methods are a catch-all group of techniques, which
perform feature selection as part of the model construction process.
The classical example is the LASSO method for constructing a linear
model, which penalizes the regression coefficients with an L1 penalty,
reducing many of them to zero. Any features, which have nonzero
coefficients, are “selected” by the LASSO algorithm. Another example
of an embedded method is the Random Forest, which can be used to
obtain feature importance. This technique was used in Reference 55,
to rank protein sequence features for enzyme function classification.
After feature ranking by a random forest, Lou et al, performed
wrapper-based feature selection using a best-first forward search
strategy.57
On the other hand, techniques such as Principal Component Analysis
(PCA) or Linear Discriminant Analysis (LDA) produce a smaller set
of new synthetic features from a linear combination of the original
ones. PCA was used in References 17, 25, 93-95, while multilabel
LDA was used in Reference 59. Apart from reducing the dimensionality
of the input features, it may also be desirable to reduce the space
of possible output labels. Makrodimitris et al. developed two novel
Label-Space Dimensionality Reduction (LSDR) techniques to improve
the CAFA performance of several function prediction algorithms.58
From a NLP point of view, non-negative matrix factorization was used
in Reference 96, to transform the bag of words input features into a
new, compressed space that captures the variability of the data.
2.3 | Machine learning algorithms and models
Machine learning techniques are used to determine the parameters of
a data-driven model, which would translate a given input to the
TABLE 1 Summary of typical features used to represent proteins for functional classification
Feature Advantages Disadvantages Usage in literature
Physicochemical
properties
Simple and numeric Do not capture enough information
about the protein
25 34-44
Sequence-based Capture plenty of information Typically require a conversion process to
numeric data for machine learning
2,7,15-17,33,34,36,38,40,42-72
PPI networks Neighboring proteins have a high probability of
sharing functions
Reliability of PPI data depends on the
experimental source
45,68,73-76
Biomedical text Provides a rich source of information which is
currently under-utilzed
Results are strongly affected by how
informative the selected terms are
16,77-82
Immunohistochemistry
images
Rich in features, easy to visualize Requires more computational power and
larger datasets, only useful for
subcellular localization tasks
83
Representation learning Removes the need for manual feature
engineering and selection
Requires more computational power and
larger datasets
32,33,45,47,64,77,84
BONETTA AND VALENTINO 401
correct output. Protein function prediction is a classification problem,
as the input needs to be mapped to a discrete output. Classifier
models can be trained to perform this task using supervised,
unsupervised, or semi-supervised learning. In supervised learning, a
training dataset is available with a series of output labels (known as
the ground truth) corresponding to input vectors. On the other hand,
in unsupervised learning no ground truth is provided. Therefore,
unsupervised learning techniques are primarily concerned with finding
patterns and structures (eg, clusters) in the data, which then may need
to be analyzed further. Semi-supervised learning lies between the two
previous learning paradigms, in that the training set typically contains
a mixture of a small amount of labeled data and a large amount of
unlabeled data.
A large variety of machine learning algorithms and models have
been developed in the past decades, and have also been applied in
many contexts and applications. Among the simplest of supervised
learning algorithms is logistic regression, in which a sigmoid function
is used as a squashing function to map a real-valued input to a range
from 0 to 1. You et al trained a logistic regressor on text-based features
(TFIDF and D2V), derived from the MEDLINE biomedical literature
database. This was done to predict between molecular function,
biological process, and cellular component.77
A kernel logistic regression
model based on diffusion kernels for protein interaction networks
was developed in Reference 73. The model achieved better prediction
accuracy when compared to a previous model based on Markov random
fields. Similarly, the authors in Reference 74 also trained a logistic
regressor to predict protein function based on protein-protein
interactions.
Naive Bayes classifiers are a family of simple, probabilistic classifiers,
which apply Bayes’ theorem with the strong (naive) assumption
that all features are independent from each other given the class variable.
In Reference 97, the authors train a Naive Bayes classifier to predict
protein-protein interaction sites, while in Reference 98, the
Extended Local Hierarchical Naive Bayes algorithm99
was used.
The SVM algorithm100
seeks to maximize the separation between
points corresponding to different classes in some n-dimensional
space, and therefore, determines a maximum-margin hyperplane. As
the data are often not linearly separable in the original feature space,
they are typically mapped to a higher-dimensional space in which the
separation should be easier. This is achieved by means of kernel functions,
such as the polynomial or the radial basis function (Gaussian
function). These kernels have different variables (known as hyperparameters),
which need to be tuned in order to achieve better performance.
In the case of the widely used radial basis function, these
include γ (which controls the width of the gaussian) and C (a penalty
factor which controls overfitting vs underfitting) hyperparameters.
Due to its successes in other fields, it is the most commonly used
algorithm in initial works, which attempted to use machine learning
techniques for protein function prediction. Examples of prior work
using SVMs include References 2, 15, 32, 35, 38, 39, 41, 42, 46, 51,
60, 65-67, 69, 72, 81, 82, 95, 101-107. Generally, the best hyperparameters
are identified through a grid search in the parameter
space. In Reference 37, other techniques such as genetic algorithms
and particle swarm optimization were also attempted, however, the
grid search still yielded the best values.
The k nearest neighbors (kNN) algorithm is a nonparametric
method which classifies a given observation through a majority vote
of the labels of the closest k points in a given feature space. No model
training is required. However, the majority voting procedure suffers
when the class distribution is skewed. Examples in the literature in
which kNN was used include References 25, 54, 58, 76, 80.
Ensemble methods combine several base models in order to produce
a better predictive model. There are two categories of ensemble
methods. In sequential methods, “boosting” is used to incrementally
build an ensemble, by training each model on the same dataset but
adjusting the weights of individual data points according to the error
of the last prediction. Examples of such methods include AdaBoost108
and Gradient Boosting.109
The XGBoost algorithm,110
is a scalable
tree boosting system, which was used in Reference 68 to classify
human proteins as aging-related or nonaging-related.
On the other hand, parallel methods use “bagging” (also known as
bootstrap aggregation), to generate multiple base models simultaneously.
The random forest111
uses bagging as one of the two main
sources of randomization, the other being the fact that it randomly
samples features to be used as candidate features for selecting the
best feature to split the data at each tree node. This technique was
used for protein function prediction in References 43, 55, 71, 112. In
Reference 113, a protein function prediction method called transductive
multilabel classifier was developed, based on a directed
birelational graph that models the relationship between proteins and
functions. This was extended in the same paper to transductive multilabel
ensemble classification.
Pitting several machine learning models against each other, and
then determining the prediction output based on voting can also be
used to develop an ensemble method. In Reference 114, the majority
vote was used together with the mean ensemble and top k ensemble
algorithms in predicting human protein subcellular localization.
Decision trees are one of the simplest machine learning models.
Each leaf in the tree represents a decision or output of the model. A
decision would have been reached after traversing a particular path
along the tree's branches. Several implementations of decision trees
exist. The C4.5 decision tree,115
is generally used for classification and
has been used in Reference 40. A novel decision tree classifier presented
in Reference 62, improved on the C4.5 technique by using the
uncertainty measure for best attribute selection. In Reference 116,
the Clus-HMC heuristic117
was used to select the best attributes to
construct the tree. Another novel implementation of a decision tree
was the Recursive Maximum Contrast Tree developed in Reference
118.
A neural network consists of a series of interconnected layers of
units called neurons. Neurons are also known as perceptrons, which
give rise to the term “multilayer perceptron,” a typical neural network
architecture. The number of neurons in the input layer should match
the input feature dimension, while the number of neurons in the output
layer should match the number of outputs. In classification problems,
it is desirable to represent the output from the network using
402 BONETTA AND VALENTINO
one hot encoding. In one hot encoding, categorical variables are represented
by a binary vector having a length equivalent to the cardinality
of the set of values of the categorical variables. The vector is filled
with zeros, except at the index of the categorical value, which is
assigned a 1. The output of a given neuron is computed via an activation
function, which in turn, takes as an input the weighted sum from
the previous layer of neurons. Typical activation functions include the
sigmoid, tanh and rectified linear unit (ReLU). Therefore, the goal of
training is to learn appropriate values for the weights so as to obtain a
correct output for a given input. In order to learn more complex inputoutput
mappings, the neural network architecture typically has a number
of intermediate layers called hidden layers.
In Reference 119, the authors performed hierarchical multilabel
classification using local multilayer perceptrons. This approach takes
into account the fact that proteins may perform several functions,
which may be further specialized into subfunctions. The large number
of output labels (eg, several thousand) can hinder the performance of
machine learning algorithms. Therefore, in Reference 93, an ensemble
of 100 neural networks was trained to predict protein function, each
with 100 outputs, rather than a single neural network. A hierarchical
neural network was also trained in Reference 120, exploiting the
inherent hierarchical nature of protein function. The authors trained
both Adaline networks, composed of a single layer of adjustable
weights, and multilayer perceptrons (two layers). The latter architecture
achieved better performance. Multilabel hierarchical classification
was performed using competitive neural networks in Reference 121.
The difference with respect to standard multilayer perceptrons is that
neurons of the output layer compete to be activated, such that only
one output neuron will be declared the “winner” of the competition
process.
Another algorithm, which seeks to mimic the function of the brain,
is the neural response.122
It simulates the neuronal behavior of the
visual cortex, and was used for protein function prediction in Reference
61, by defining a distance metric that corresponds to the similarity
of the amino acid subsequences. The latter was important to
understand how the brain can distinguish different sequences. Probabilistic
neural networks use the Bayes optimal decision rule for classification,
and take into account the probability density function for
each class. The latter can be estimated using the Parzen nonparametric
estimator. Probabilistic neural networks were used in Reference
46, and offered better predictive performance than kNN and SVM in
identifying protein functional families from sequence.
Rather than being limited to predicting continuous or discrete valued
outputs, deep learning123
is particularly concerned with learning
data representations, that is, feature learning. This allows the model
to automatically discover the required features, replacing the traditional
feature engineering and selection process. Therefore, they are
also known as end-to-end models. Deep learning is commonly associated
with neural network architectures, which have several hidden
layers. With the advances in computing power afforded by Graphical
Processing Units (GPUs), training of deep learning models for a variety
of tasks is now within reach. This holds only for the cases where a
very large amount of data is available, in order to properly estimate
the very large number of parameters of deep neural networks.123
In Reference 45, three separate models (one per each GO subontology)
were trained using deep learning on amino acid sequence
and protein-protein interaction data. Trigrams were built from the
amino acid sequences and converted to dense embeddings, while the
PPI network features were used to generate knowledge graph embeddings.
The sequence features were then passed through a 1D convolutional
layer, and max pooling was then performed. The output
from max pooling was then combined with the PPI network features
into a fully connected layer with 1024 neurons, which was subsequently
passed to hierarchically structured neural networks with sigmoid
activation functions for classification. The full architecture is
shown in figure of Reference 45.
Three deep architectures were evaluated in Reference 84 to predict
human protein function. A multitask deep neural network
(MTDNN), which consisted of shared hidden layers and task-specific
hidden layers, comprised the first architecture and was developed by
the authors. Its performance was compared to a multilabel deep neural
network (in which shared hidden layers are used all the way until
the final output layer), and a single-task deep neural network. The
MTDNN performed better than the other two architectures, as well
as FFPred3 and BLAST.
Deep network fusion was used for protein function prediction in
Reference 32. A multimodal deep autoencoder was used to extract
features, which were then passed on to a SVM. In Reference 47, deep
learning was used to learn embeddings for protein sequences, which
were restricted to a maximum length of 2000. Each amino acid was
represented as a 23-dimensional vector. A convolutional layer
together with average pooling was then trained on a GPU. A similar
approach was used in Reference 107, in which the output from a stacked
denoising autoencoder was passed onto a binary-relevance
SVM. The input dataset consisted of microarray expression data and
phylogenetic profiles for yeast. The authors in Reference 64 focused
on human protein subcellular localization, and also used a stacked
autoencoder. They tried SVM, random forests and softmax regression
in the last layer of the deep learning network to make predictions, and
found that the best results were achieved with the latter. Proteinprotein
interaction was the subject of machine learning based prediction
in Reference 48. The authors applied a stacked autoencoder to
protein sequence autocovariance Pan's PPI dataset. A similar
approach was used in Reference 49.
Deep learning has also been applied to protein function prediction
with text-based features. Deep semantic text representation was used
in Reference 77 for biomedical literature. Multifunctional enzyme
function prediction was achieved in Reference 78, with hierarchical
multilabel deep learning.
As mentioned previously, immunohistochemistry images are a
potential source of features in protein function prediction. Standard
(so-called vanilla) neural network architectures run into problems
when they need to be applied to images (in two dimensions or more
when also considering, eg, color) or sequential data. In the case of the
former, the high dimensionality due to the high image resolution
BONETTA AND VALENTINO 403
means that the neural network will have many parameters, which
need to be learned. This leads to slow training and poor performance.
Convolutional Neural Networks (CNNs),124
on the other hand, take
advantage of local spatial coherence in the images, and perform convolution
operations, which result in fewer parameters, which need to
be learned. Several CNN architectures developed recently, such as
VGG16125
or AlexNet,126
are achieving high performance. In addition,
it is possible to use these pretrained models on unseen data. In Reference
127, the author used a CNN in conjunction with both a SVM
and a kNN classifier to predict protein function. On the other hand, in
Reference 70, CNNs were trained to identify families of efflux proteins
in transporters on features extracted from PSSM profiles.
Recurrent Neural Networks (RNNs)128
are suitable for processing
sequential data, as their architecture allows for maintaining an internal
state. Long short-term memory (LSTM)129
networks are an evolution
of RNNs, which have been applied successfully in a number of
domains, from speech recognition130
to DNA sequences.131
In Reference
53, a deep RNN was used to predict protein function from
sequence, while in Reference 132 the authors used a three-unit LSTM
together with neural machine translation.
In several cases, rather than just using a single machine learning
model, results were achieved through a combination of algorithms.
For instance, in Reference 63 the classification results from neural
networks and SVM were fused via a heuristic fusion rule. In Reference
133, an ensemble multi-instance, multilabel learning neural network
was trained. Multi-instance,134
multilabel learning is useful when an
observation is described by multiple instances and associated with
multiple class labels.135
Therefore, it is particularly applicable to protein
function prediction, as proteins are often inherently multidomain
and multifunctional, and each domain may fulfill its own function independently
or in a concerted manner with its neighbors. A two-layer
architecture was developed in Reference 133. In the first layer, training
examples for each class label were clustered by invoking kmedoids,
and then medoids of clustered groups were retained. Then
neural nets were used to compute the basis functions between one
example and the medoids.
Several other machine learning algorithms and models were used
only sparingly in the literature. Rough set theory was developed in the
early 1980s as a mathematical approach to intelligent data analysis
and data mining.136
It distinguishes between objects based on the
concept of indiscernibility, and deals with the approximation of sets
using binary relations constructed from empirical data. Rough sets
were used in Reference 91 to predict between seven pectin lyase-like
subfamilies. In Reference 137, the protein was modeled as a document,
while the protein function label was the topic. A supervised
topic model (labeled latent Dirichlet allocation138
) was used to make
predictions based on protein sequences organized into a bag of
words. Bag of words from protein sequence was also used to generate
features in Reference 59, with a model based on multilabel linear discriminant
analysis then being trained. Finally, multilabel Gaussian kernel
regression was used in Reference 139.
3 | MACHINE LEARNING MODEL
IMPLEMENTATION, TUNING, AND
EVALUATION
In the past decade, machine learning frameworks have evolved in
response to increasing demand across various disciplines. The most
commonly used frameworks are Scikit-Learn140
for the Python programming
language (which has been ranked as the most commonly
used programming language by the IEEE Spectrum since 2017141
),
several packages in R and the Statistics and Machine Learning toolbox
of MATLAB.142
Due to the computationally intensive nature of deep
learning, several libraries and frameworks have also been developed
which allow models to be trained at faster speeds on GPUs and computing
clusters, such as TensorFlow,143
Keras,144
Caffe,145
and
PyTorch.146
With increasing amounts of training data available, from
Gene Ontology to biomedical literature, as well as new, computationally
intensive architectures, which use deep learning, the use of GPUs
is becoming more prevalent in training machine learning algorithms to
predict protein function. Examples of works in the literature, which
made use of hardware acceleration include References 45, 47,
64, 107.
The machine learning models described in the previous
section often need to be tuned to achieve a more satisfactory performance.
This involves determining appropriate hyperparameters (which
are set prior to training by the data scientist as opposed to the parameters,
which are learnt during training), which also allow the model to
generalize and perform well even with unseen data. There are only a
few general hyperparameters, such as the optimizer (eg, Adam,147
RMSprop,148
or stochastic gradient descent) and the learning rate,
while the rest are usually specific to a particular model or algorithm. In
particular, in neural networks, these might be the activation function
of the neuron, as well as the number of neurons in each layer and the
number of hidden layers. Neural network performance can also be
boosted by conducting the training over multiple epochs, or by
increasing the number of times that the training data flows through
the network.
Although there are suggested ranges of values and rules of thumb,
there is no exact science of selecting the best hyperparameters prior
to training. The most widely used method is to perform a search in
the space of hyperparameters, either in a random fashion or using a
systematic grid search. The hyperparameter search is often facilitated
by several modern machine learning frameworks and combined with
cross-validation. In k-fold cross-validation, its simplest form, the randomly
shuffled training dataset is split into k groups, and in each
group a percentage of samples is used as a training set, while the
remainder is used as a test set. Therefore, an averaged performance
result can be obtained, avoiding the so-called “lucky split.”
Sometimes, the performance of machine learning classifiers can be
boosted by mitigating class-imbalance, which occurs when the number
of samples in each class is skewed. In Reference 149, the authors
compared the performance of three class-balance strategies for SVM
in relation to protein function prediction. These included under-
404 BONETTA AND VALENTINO
sampling (in which the extra samples of the majority class(es) are discarded),
Synthetic Minority Over-sampling Technique (SMOTE), in
which synthetic samples of the minority class are added to the
dataset, and weighted SVM, which keeps the number of samples in
each class but assigns appropriate weights during training to specifically
improve the performance for the minority class. The latter two
techniques achieved the best results, although weighted SVM was
less computationally demanding.
As protein function prediction is generally treated as a classification
problem, the metrics used to evaluate the performance of the machine
learning models typically include accuracy, precision, recall (sensitivity),
specificity, and the F1-score. The F1-score is defined as the harmonic
average of the precision and the recall, and handles class imbalance better
than accuracy (since accuracy can be trivially maximized by always
predicting the majority class). In addition, for better visualization and
performance understanding, a Receiver Operating Characteristic (ROC)
curve can be obtained by plotting the true positive rate as a function of
the false positive rate. The larger the Area Under Curve (AUC), the better
the performance, as this normally means that a higher true positive
rate is achieved for the same false positive rate. A similar metric can be
derived from the precision-recall (PR) curve, known as Area Under PR
(AUPR). Two further metrics used as the gold standard in the CAFA
challenges are Fmax and Smin.
11
The former is defined as the maximum
F1-score obtained by varying the classifier threshold (and therefore the
working point along, eg, a precision-recall curve), while the latter is
obtained by minimizing the uncertainty and misinformation. A list of
commonly used metrics found in the literature for evaluating the performance
of classifiers for protein function prediction is shown in
Table 2. Typically, several metrics tend to be calculated in a given work
as most of them provide complementary information.
Although certain algorithms and models have proved to learn
input-output mappings more effectively than others, and in a variety
of domains, it is appropriate to train different machine learning models
and see which results in the best performance. In Reference 152, the
performance of logistic regression, Naive Bayes, SVMs, a decision tree
and a neural network, was compared to evaluate the suitability of
using dissimilarity representations. The SVM algorithm was found to
give the best results in terms of F1-score and AUC metrics. In Reference
17, extreme machine learning was compared to a SVM, while in
Reference 57, Gaussian Naive Bayes was trained together with a decision
tree, random forest, logistic regression, kNN and SVM with both
polynomial and RBF kernels. A SVM and the kNN algorithm were
trained on sequence motifs for enzyme classification.52
Finally, in Reference
153, the performance of a simple neural network was compared
to that of a SVM for prediction of protein-protein interactions
in human Bacillus anthracis.
4 | APPLICATIONS OF MACHINE LEARNING
FOR PROTEIN FUNCTION PREDICTION
In most of the literature, machine learning algorithms are trained to
predict protein function using a particular classification scheme as a
ground truth. The most common taxonomies are Functional Catalogue
(FunCat),154
Enzyme Commission (EC),155
and Gene Ontology.5
The
FunCat annotation scheme consists of 28 main categories that cover
general features such as cellular transport, metabolism, and protein
activity regulation. Each of the main branches has a hierarchical, treelike
structure. The EC is a hierarchical classification scheme for
enzymes, based on the chemical reactions they catalyze. The top level
consists of seven enzyme classes, such as oxidoreductases, hydrolases,
and ligases. Gene ontology (GO) defines a representation of
terms for gene product properties. The ontology covers three
domains: cellular component (which refer to parts of a cell or its extracellular
environment); molecular function (the elemental activities of a
gene product at the molecular level, such as binding or catalysis); and
biological process (operations or sets of molecular events with a
defined beginning and end, relevant to the functioning of integrated
living units such as cells and tissues). Each of the three GO ontologies
is a Directed Acyclic Graph (DAG), where a node (GO term) can have
multiple parents in the hierarchy, unlike the simpler tree-based hierarchies
for the EC code and FunCat mentioned earlier.
Earlier works focused on classification schemes such as EC and
FunCat. In Reference 36, the authors train a SVM and a Random Forest
to predict top-level EC classes from seven features, such as amino
acid sequence, molecular weight, and chain length. SVMs were also
used in Reference 60 for the same class prediction, this time using
two sequences per protein corresponding to the primary and secondary
structures. In Reference 34, SVMs were trained on features such
as physicochemical properties and sequence similarities to predict five
different levels of enzymatic function. Very good results (F1-score of
0.99) were achieved. Protein functions were predicted according to
the FunCat taxonomy in Reference 59, using multilabel linear discriminant
analysis, and in Reference 75 using multilabel semi-supervised
learning on graphs, which were based on input features from proteinprotein
interactions. The latter category of input features was also
used in Reference 74 to train a logistic regressor and predict 17 FunCat
classes.
Protein function prediction based on GO terms is a more recent
initiative. The various editions of the CAFA challenge have made
available an increasing number of sequences to the community with
the task of predicting GO annotations. A summary of the latest
CAFA3 and CAFA-π challenges is available in Reference 156. The
GOLabeler ensemble method,50
which combines BLAST-kNN, logistic
regression and a Naive GO term frequency computation to solve the
problem of Learning To Rank (LTR) achieved the best performance
when compared to other CAFA3 entries across the board (ie, for
molecular function, biological process and cellular component ontologies).
Machine learning techniques have been used to prediction
functions related to one, two, or all three domains. According to
CAFA, the prediction accuracy, which uses machine learning techniques
is lowest for the Biological Process domain.11
Around a dozen
works attempt to predict protein function related to all three domains.
The techniques used, which were already expanded upon earlier,
range from deep neural networks32,45,47,84,132
to kNN,58
logistic
regression,77
and SVMs.56,79,103
In Reference 80, a kNN classifier was
BONETTA AND VALENTINO 405
used to predict protein function from text-based features derived
from biomedical literature in both the molecular function and biological
process domains.
Only molecular function was considered in the works of References
70, 89, 96, 149, 157. Transductive multilabel ensemble classification
was used to determine protein functions related only to
Biological Process in Reference 113. Most of the literature, which
focused on predicting protein function pertinent to a sole domain,
was relevant to the cellular component category. The most used
machine learning model has been SVM, which was used in References
15, 16, 44, 65, 66, 72, 81, 101, 150, 158. Deep learning is used in
References 64, 139. In Reference 83, immunohistochemistry images
from the Human Protein Atlas database were used. An ensemble
strategy was used in Reference 114 for human protein subcellular
localization, while the authors in Reference 71, used a random forest
to predict Golgi-resident protein types from non-Golgi resident.
In other instances in the literature, machine learning algorithms
were trained to predict whether a given protein would fit into one of
a select number of classes. In Reference 151, three different models
(SVM, random forest, and kNN) were trained on short-linear motifs to
predict whether a particular protein was a calmodulin-binding or a
mitochondrial protein. A SVM was also used in Reference 104, to
TABLE 2 List of commonly used metrics found in the literature for evaluating the performance of classifiers for protein function prediction
Metric Advantages Disadvantages Usage in literature
Accuracy Answers the question: how many samples
were correctly labeled out of all
samples?
Provides misleading information in the
event of class imbalance
16,17,25,32,47,48,53,54,56,60,64-67,91,105,139,150,151
Precision Answers the question: how many samples
labeled as COI actually belong to the
COI?
Does not consider false negatives 17,34,36,46-48,53-55,60,82,107,132,150
Recall Answers the question: of all the samples
which actually belong to the COI, how
many were correctly predicted?
Does not consider false positives 15,17,34,36,39,46-48,53-56,60,74,82,107,132,139,150
Specificity Answers the question: of all the samples
which do not belong to the COI, how
many were correctly predicted?
Does not consider false negatives 15,39,46,48,56,62,74,139,150
F1-score Better suited for cases of class imbalance Not as intuitive as other metrics 32,34,36,47,53,56,96,107,132
AUROC Score is independent of the threshold set
for the classifier
Provides misleading information in the
event of class imbalance
36,50,56,69
AUPR Score is independent of the threshold set
for the classifier and not affected by
class imbalance
Does not consider true negatives 50,58,77
Fmax Considers predictions across the full
spectrum from high to low sensitivity
Penalizes specific predictions 45,50,77,82,84
Smin Takes the structure of the ontology and
the dependencies between terms
induced by a hierarchical ontology into
account
Assumes that a Bayesian network
structured according to the underlying
ontology will perfectly model the prior
probability distribution of a target
variable
50,77
Abbreviations: COI, class of interest; FN, false negative; FP, false positive; TN, true negative; TP, true positive.
TABLE 3 Performance comparison of various machine learning models and algorithms on the EC taxonomy
Taxonomy Protein ML method
Hyperparameter
optimization Result (metric)
Usage in
literature
EC Enzyme Random forest N/A 0.486 (F1-score) 36
Enzyme SVM N/A 0.480 (F1-score) 34
Enzyme C4.5 Classifier N/A 0.7213
(F1-score)
40
Enzyme SVM GA 0.70 (F1-score) 37
Enzyme SVM PSO 0.69 (F1-score) 37
Enzyme Deep neural
network
N/A 0.965 (F1-score) 78
Abbreviations: GA, genetic algorithm; PSO, particle swarm optimization.
406 BONETTA AND VALENTINO
determine whether a particular protein was an apolipoprotein or
not. Apolipoproteins are crucial in cardiovascular systems and
drug design. The authors in Reference 89 developed a tool
(HBPred) to identify growth hormone-binding proteins. Dipeptide
composition, which describes the correlation between the two
most contiguous amino acid residues, was used as a feature on
which a SVM was trained. The same type of machine learning
model was also used in Reference 69 to classify signaling proteins
based on molecular star graph descriptors. A plethora of techniques,
including Gaussian Naive Bayes, decision trees, random
forest, and logistic regression, were used in Reference 57, to
develop a binary classifier between DNA-binding and non DNAbinding
proteins. In Reference 62, a decision tree classifier was
trained to predict between the five molecular classes of HPRD,
namely defensin, cell surface receptor, DNA repair protein, heat
shock protein, and voltage gated channel. In Reference 54, a kNN
multilabel classifier was used to predict enzyme function at the
level of chemical mechanism. A SVM was used to predict between
RNA-binding, DNA-binding and EF-hand proteins in Reference 41.
Rough sets were used to predict between seven pectin lyase-like
subfamilies in Reference 91, based on features derived from
amino acid composition.
Despite the significant body of work in which machine learning
algorithms are trained to predict protein function, relatively little
effort has been devoted to the issue of class imbalance in function
labels. This imbalance is a result of the fact that for example, the GO
TABLE 4 Performance comparison of various machine learning models and algorithms on the FunCat taxonomy
Taxonomy Protein ML method
Hyperparameter
optimization Result (metric)
Usage in
literature
FunCat Yeast MLDA N/A 0.412 (F1-score) 59
Yeast MLDA + graph N/A 0.437 (F1-score) 59
Yeast NMLDA + graph N/A 0.440 (F1-score) 59
Yeast MCSL-d (PPI-weight,
infor)
grid-search 0.4857
(F1-score)
75
Yeast MCSL-b (PPI-weight,
infor)
grid-search 0.4865
(F1-score)
75
Abbreviations: MCSL, multilabel correlated semi-supervised learning; MLDA, multilabel linear discriminant analysis; NMLDA, L1-normalized MLDA.
TABLE 5 Performance comparison of various machine learning models and algorithms on the GO taxonomy (molecular function)
Protein ML method
Hyperparameter
optimization Result (metric)
Usage in
literature
Yeast Autoencoder + SVM Manual adjustment of activation functions, number and sizes
of hidden layers, batch sizes and learning rates for
autoencoder Nested 5-fold CV via grid-search over γ and C
for RBF kernel for SVM
0.27 (F1-score) 32
Human Autoencoder + SVM Same as above 0.18 (F1-score) 32
Human Deep neural network Manual tuning of minibatch size, number of convolution
filters, filter size, number of neurons in fully connected layer
and learning rate.
0.51 (Fmax) 45
Difficult proteins LR and BLAST-kNN N/A 0.62 (Fmax) 77
Difficult proteins LR and BLAST-kNN N/A 5.171 (Smin) 77
Difficult proteins LR and BLAST-kNN N/A 0.567 (Fmax) 50
Difficult proteins LR and BLAST-kNN N/A 5.087 (Smin) 50
Human LR and BLAST-kNN N/A 0.625 (Fmax) 156
Human MTDNN grid-search performed using HYPEROPT 160
to obtain number
of shared layers, number of hidden units in each shared
layer, number of specific layers, number of hidden units in
each shared layer, drop-out rate, learning rate, L1/L2
regularization
0.311 (Fmax) 84
Human MLDNN grid-search performed using HYPEROPT 160
on number of
hidden layers, number of units inside each hidden layer,
batch size, learning rate, dropout rate and L1/L2
regularization
0.343 (Fmax) 84
Human STDNN Same as MLDNN 0.338 (Fmax) 84
Note: Difficult proteins have a global sequence identity of less than 60%.
Abbreviations: CV, cross-validation; LR, logistic regression.
BONETTA AND VALENTINO 407
TABLE 6 Performance comparison of various machine learning models and algorithms on the GO taxonomy (biological process)
Protein ML method Hyperparameter optimization Result (metric)
Usage in
literature
Yeast Autoencoder + SVM Activation functions, number and sizes of hidden layers, batch
sizes and learning rates for autoencoder Nested 5-fold CV
via grid-search over γ and C for RBF kernel for SVM
0.19 (Fmax) 32
Human Autoencoder + SVM Same as above 0.125 (Fmax) 32
Human Deep neural network Manual tuning of minibatch size, number of convolution
filters, filter size, number of neurons in fully connected layer
and learning rate.
0.42 (Fmax) 45
Difficult proteins LR and BLAST-kNN N/A 0.46 (Fmax) 77
Difficult proteins LR and BLAST-kNN N/A 16.82 (Smin) 77
Difficult proteins LR and BLAST-kNN N/A 0.382 (Fmax) 50
Difficult proteins LR and BLAST-kNN N/A 14.538 (Smin) 50
Human MTDNN grid-search performed using HYPEROPT 160
to obtain number
of shared layers, number of hidden units in each shared
layer, number of specific layers, number of hidden units in
each shared layer, drop-out rate, learning rate, L1/L2
regularization
0.298 (Fmax) 84
Human MLDNN grid-search performed using HYPEROPT 160
on number of
hidden layers, number of units inside each hidden layer,
batch size, learning rate, dropout rate and L1/L2
regularization
0.287 (Fmax) 84
Human STDNN Same as MLDNN 0.288 (Fmax) 84
Note: Difficult proteins have a global sequence identity of less than 60%.
Abbreviations: CV, cross-validation; LR, logistic regression.
TABLE 7 Performance comparison of various machine learning models and algorithms on the GO taxonomy (cellular component)
Protein ML method Hyperparameter optimization Result (metric) Usage in literature
Yeast Autoencoder + SVM Activation functions, number and sizes of hidden layers, batch
sizes and learning rates for autoencoder Nested 5-fold CV
via grid-search over γ and C for RBF kernel for SVM
0.155 (Fmax) 32
Human Autoencoder + SVM Same as above 0.125 (Fmax) 32
Human Deep neural network Manual tuning of minibatch size, number of convolution
filters, filter size, number of neurons in fully connected layer
and learning rate.
0.60 (Fmax) 45
Difficult proteins LR and BLAST-kNN N/A 0.69 (Fmax) 77
Difficult proteins LR and BLAST-kNN N/A 4.45 (Smin) 77
Difficult proteins LR and BLAST-kNN N/A 0.706 (Fmax) 50
Difficult proteins LR and BLAST-kNN N/A 5.344 (Smin) 50
Human LR and BLAST-kNN N/A 0.6 (Fmax) 156
Human MTDNN grid-search performed using HYPEROPT 160
to obtain number
of shared layers, number of hidden units in each shared
layer, number of specific layers, number of hidden units in
each shared layer, drop-out rate, learning rate, L1/L2
regularization
0.484 (Fmax) 84
Human MLDNN grid-search performed using HYPEROPT 160
on number of
hidden layers, number of units inside each hidden layer,
batch size, learning rate, dropout rate and L1/L2
regularization
0.449 (Fmax) 84
Human STDNN Same as MLDNN 0.425 (Fmax) 84
Note: Difficult proteins have a global sequence identity of less than 60%.
Abbreviations: CV, cross-validation; LR, logistic regression.
408 BONETTA AND VALENTINO
database rarely stores which proteins do not possess a particular function.
In Reference 159, the authors developed two novel negative
selection algorithms (Selection of Negatives through Observed Bias
and Negative Examples from Topic Likelihood) to determine whether
a protein does or does not perform a particular function.
A summary of the comparison in performance between various
machine learning models and algorithms is provided in Tables 3 and 4
for the EC and FunCat taxonomies respectively, and Tables 5–7 for
the molecular function, biological process, and cellular component GO
taxonomies respectively. As can be seen, machine learning methods
applied to the EC and FunCat taxonomies generally did not disclose
any hyperparameter optimization strategy, except for a grid search. In
particular, in Reference 37 genetic algorithms and particle swarm optimization
were used, however did not result in an increase in performance
with respect to Reference 40, which used a decision tree.
Multilabel correlated semi-supervised learning75
with grid-search gave
an improvement in terms of F1-score over multilabel linear discriminant
analysis59
for the FunCat taxonomy. The best performance for
molecular function GO terms was given by GoLabeler which did not
make use of deep learning or hyperparameter optimization. A similar
model developed by the same authors also gave the best performance
for biological process for difficult proteins (ie, proteins have a global
sequence identity of less than 60%).77
However, for the cellular component
ontology, it performed as well as a deep neural network
approach (which required hyperparameter optimization) for human
proteins.45
5 | CONCLUSIONS AND FUTURE
PERSPECTIVES
This paper has reviewed the evolution in features and machine learning
techniques used to train data-driven models for protein function
prediction. Although there has been a rise in the use of deep learning
techniques to extract meaningful features and develop high performing
predictors, methods using classical machine learning techniques
such as logistic regression were still able to outperform deep
learning approaches. In addition, methods which do not use machine
learning still feature prominently in the top 10 performers of CAFA
3, as opposed to deep learning approaches. The fact that deep learning
requires a very large amount of data remains a limitation, which
probably reduces its success at least in some studies concerning protein
function prediction. Nevertheless, the bioinformatics community
has been quite successful in its efforts to bring machine learning and
proteins together, through initiatives such as the CAFA challenge and
Kaggle competitions. The community will keep this momentum going
by facilitating the proliferation of databases and frameworks which
are more appropriate for machine learning. Researchers are now also
resorting to a much wider variety of input features, particularly those
derived from biomedical text. Reliable data-driven models are key to
narrowing the gap between the number of sequences with known
and unknown function, which will ultimately help elucidate the effect
of mutations in proteins on diseases and in the engineering of new
proteins.
ORCID
Rosalin Bonetta https://orcid.org/0000-0003-4696-7770
Gianluca Valentino https://orcid.org/0000-0003-3864-7785
REFERENCES
1. Eisenberg D, Marcotte EM, Xenarios I, Yeates TO. Protein function
in the post-genomic era. Nature. 2000;405:823-826.
2. Al-Shahib A, Breitling R, Gilbert DR. Predicting protein function by
machine learning on amino acids sequences - a critical evaluation.
BMC Genomics. 2007;8:78.
3. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction
of protein function. Cell Mol Life Sci. 2003; 60: 2637-2650.
4. Mills CL, Beuning PJ, Ondrechen MJ. Biochemical functional predictions
for protein structures of unknown or uncertain function. Comput
Struct Biotechnol J. 2015;02(13):182-191. https://www.ncbi.nlm.
nih.gov/pubmed/25848497.
5. Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the
unification of biology. Nat Genet. 2000;25:25-29.
6. Friedberg I. Automated protein function prediction - the genomic
challenge. Brief Bioinform. 2006;7:225-242.
7. Lee D, Redfern O, Orengo C. Predicting protein function from
sequence and structure. Nat Rev Mol Cell Biol. 2007;8:995-1005.
8. Gardy JL, Brinkman FS. Methods for predicting bacterial protein subcellular
localization. Nat Rev Microbiol. 2006;4:741-751.
9. The UniProt Consortium. UniProt: the universal protein
knowledgebase. Nucleic Acids Res. 2017;45:D158-D169.
10. Punta M, Coggill PC, Eberhardt RY, et al. The Pfam protein families
database. Nucleic Acids Res. 2012;40:D290-D301.
11. Jiang Y, Oron TR, Clark WT, et al. An expanded evaluation of protein
function prediction methods shows an improvement in accuracy.
Genome Biol. 2016;17:184.
12. Bernardes JS, Pedreira CE. A review of protein function prediction
under machine learning perspective. Recent Pat Biotechnol. 2013;7:
122-141.
13. Sharma M, Garg P. Computational approaches for enzyme functional
class prediction: a review. Curr Proteomics. 2014;11(1):17-22.
https://www.ingentaconnect.com/content/ben/cp/2014/
00000011/00000001/art00003.
14. Wang Z, Zou Q, Jiang Y, Ju Y, Zeng X. Review of protein subcellular
localization prediction. Curr Bioinformatics. 2014;9(3):331-342.
https://www.ingentaconnect.com/content/ben/cbio/2014/
00000009/00000003/art00015.
15. Hoglund A, Donnes P, Blum T, Adolph HW, Kohlbacher O. MultiLoc:
prediction of protein subcellular localization using N-terminal
targeting sequences, sequence motifs and amino acid composition.
Bioinformatics. 2006;22:1158-1165.
16. Shatkay H, Hoglund A, Brady S, Blum T, Donnes P, Kohlbacher O.
SherLoc: high-accuracy prediction of protein subcellular localization
by integrating text and protein sequence data. Bioinformatics. 2007;
23:1410-1417.
17. You ZH, Lei YK, Zhu L, Xia J, Wang B. Prediction of protein-protein
interactions from amino acid sequences with ensemble extreme
learning machines and principal component analysis. BMC Bioinformatics.
2013;14:S10.
18. Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions
in yeast. Nat Biotechnol. 2000;18:1257-1261.
BONETTA AND VALENTINO 409
19. Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment
for integrated models of biomolecular interaction networks.
Genome Res. 2003;13:2498-2504.
20. Govindan G, Nair AS. Composition Transition and Distribution
(CTD)? A dynamic feature for predictions based on hierarchical
structure of cellular sorting. 2011 Annual IEEE India Conference,
2011. p. 1–6.
21. Liu ZX, Liu SL, Yang HQ, Bao LH. Using protein granularity to extract
the protein sequence features. J Theor Biol. 2013;331:48-53.
22. Chou KC. Prediction of protein signal sequences and their cleavage
sites. Proteins. 2001;42:136-139.
23. Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: detection
of distantly related proteins. Proc Natl Acad Sci USA. 1987;84:4355-
4358.
24. Jeong JC, Lin X, Chen XW. On position-specific scoring matrix for
protein function prediction. IEEE/ACM Trans Comput Biol Bioinform.
2011;8:308-315.
25. Wang W, Zhang X, Meng J, Luan Y. Protein function prediction
based on physiochemical properties and protein granularity. Proceedings
of IEEE International Conference on Granular Computing
Beijing, China, 2013. p. 342–346.
26. Verspoor KM. Roles for text mining in protein function prediction.
Methods Mol Biol. 2014;1159:95-108.
27. Zeng Z, Shi H, Wu Y, Hong Z. Survey of natural language processing
techniques in bioinformatics. Comput Math Methods Med. 2015;
2015:1-10.
28. Mikolov T, Sutskever I, Chen K, Corrado G, Deap J. Distributed representations
of words and phrases and their compositionality. Proceedings
of 26th International Conference on Neural Information
Processing Systems Lake Tahoe, USA, 2013. p. 3111–3119.
29. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word
representations in vector space, 2013.
30. Asgari E, Mofrad MR. Continuous distributed representation of biological
sequences for deep proteomics and genomics. PLoS One.
2015;10:e0141287.
31. Kaggle, Human Protein Atlas Image Classification. 2018. https://
www.kaggle.com/c/human-protein-atlas-image-classification.
32. Gligorijevic V, Barot M, Bonneau R. deepNF: deep network fusion
for protein function prediction. Bioinformatics. 2018;34:3873-3881.
33. Wang J, Zhang L, Jia L, Ren Y, Yu G. Protein-protein interactions
prediction using a novel local conjoint triad descriptor of amino acid
sequences. Int J Mol Sci. 2017;18:E2373.
34. Dalkiran A, Rifaioglu A, Martin M, Cetin-Atalay R, Atalay V, Dogan T.
ECPred: a tool for the prediction of the enzymatic functions of protein
sequences based on the EC nomenclature. BMC Bioinformatics.
2018;19:334.
35. Rahman S, Bakar A, Hussein Z. Data mining framework for protein
function prediction. Proceedings of IEEE International Symposium
on Information Technology Kuala Lumpur, Malaysia, 2008.
36. Srivastava A, Mahmood R, Srivastava R. A comparative analysis of
SVM random forest methods for protein function prediction. Proceedings
of IEEE International Conference on Current Trends in
Computer, Electrical, Electronics and Communication Mysore, India,
2018. p. 1008–1010.
37. Silva M, Leijoto L, Nobre C. Algorithms analysis in adjusting the SVM
parameters: an approach in the prediction of protein function. J Appl
Artif Intell. 2017;31:316-331.
38. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support
vector machine software for functional classification of a protein
from its primary sequence. Nucleic Acids Res. 2003;31:3692-
3697.
39. Cai CZ, Wang WL, Sun LZ, Chen YZ. Protein function classification
via support vector machine approach. Math Biosci. 2003;185:
111-122.
40. Lee B, Ryu K. Feature extraction from protein sequences and classification
of enzyme function. Proceedings of IEEE International Conference
on Biomedical Engineering and Informatics Sanya, China,
2008. p. 138–142.
41. Lee B, Lee H, Kim D, Ryu K. Feature extraction in spatiallyconserved
regions and protein functional classification. Proceedings
of Frontiers in the Convergence of Bioscience and Information
Technologies Jeju City, Korea, 2007. p. 165–170.
42. Rahman S, Bakar A, Hussein Z. Experimental study of different FSAs
in classifying protein function. Proceedings of IEEE International
Conference of Soft Computing and Pattern Recognition Malacca,
Malaysia, 2009. p. 516–521.
43. Li F, Li C, Wang M, et al. GlycoMine: a machine learning-based
approach for predicting N-, C- and O-linked glycosylation in the
human proteome. Bioinformatics. 2015;31:1411-1419.
44. Acquaah-Mensah GK, Leach SM, Guda C. Predicting the subcellular
localization of human proteins using machine learning and exploratory
data analysis. Genomics Proteomics Bioinformatics. 2006;4:
120-133.
45. Kulmanov M, Khan MA, Hoehndorf R, Wren J. DeepGO: predicting
protein functions from sequence and interactions using a deep
ontology-aware classifier. Bioinformatics. 2018;34:660-668.
46. Li Y et al. SVM-Prot 2016: a web-server for machine learning prediction
of protein functional families from sequence irrespective of similarity.
PLoS One. 2016;11:e0155290.
47. Nauman M, Rehman H, Politano G, Benso A. Beyond homology
transfer: deep learning for automated annotation of proteins. J Grid
Comput. 2018;17:225–237.
48. Sun T, Zhou B, Lai L, Pei J. Sequence-based prediction of protein
protein interaction using a deep-learning algorithm. BMC Bioinformatics.
2017;18:277.
49. Wang YB, You ZH, Li X, et al. Predicting protein-protein interactions
from protein sequences by a stacked sparse autoencoder deep neural
network. Mol Biosyst. 2017;13:1336-1344.
50. You R, Zhang Z, Xiong Y, Sun F, Mamitsuka H, Zhu S. GOLabeler:
improving sequence-based large-scale protein function prediction
by learning to rank. Bioinformatics. 2018;34:2465-2473.
51. You ZH, Zhu L, Zheng CH, Yu HJ, Deng SP, Ji Z. Prediction of
protein-protein interactions from amino acid sequences using a
novel multi-scale continuous and discontinuous feature set. BMC
Bioinformatics. 2014;15:S9.
52. Ben-Hur A, Brutlag D. Sequence motifs: highly predictive features of
protein function. In: Guyon I, Nikravesh M, Gunn S, Zadeh L, eds.
Feature Extraction. Berlin, Heidelberg: Springer; 2006:625-645.
53. Liu X. Deep Recurrent Neural Network for Protein Function Prediction
from Sequence, 2017.
54. Ferrari LD, Mitchell J. From sequence to enzyme mechanism using
multi-label machine learning. BMC Bioinformatics. 2014;15:150.
55. Kumar C, Li G, Choudhary A. Enzyme function classification using
protein sequence features and random forest. Proceedings of IEEE
International Conference on Bioinformatics and Biomedical Engineering
Beijing, China, 2009.
56. Lee B, Shin M, Young J, Hae O, Ryu K. Identification of protein functions
using a machine-learning approach based on sequence-derived
properties. Proteome Sci. 2009;7:27.
57. Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H. Sequence
based prediction of DNA-binding proteins based on hybrid feature
selection using random forest and Gaussian naive Bayes. PLoS
One. 2014;9:e86703.
58. Makrodimitris S, van Ham R, Reinders M. Improving protein function
prediction using protein sequence and GO-term similarities. Bioinformatics.
2018;35:1116-1124.
59. Wang H, Yan L, Huang H, Ding C. From protein sequence to protein
function via multi-label linear discriminant analysis. IEEE/ACM Trans
Comput Biol Bioinform. 2017;14:503-513.
410 BONETTA AND VALENTINO
60. Resende W, Nascimento R, Xavier C, Lopes I, Nobre C. The use of
support vector machine and genetic algorithms to predict protein
function. Proceedings of IEEE International Conference on Systems,
Man and Cybernetics Seoul, South Korea, 2012. p. 1773–1778.
61. Yalamanchili HK, Wang J, Xiao Q. NRProF: neural response based
protein function prediction algorithm. Proceedings of IEEE International
Conference on Systems Biology Zhuhai, China, 2011.
p. 33–40.
62. Singh M, Singh P, Singh H. Decision tree classifier for human protein
function prediction. Proceedings of IEEE International Conference
on Advanced Computing and Communications Surathkal, India,
2006. p. 564–568.
63. Amidi S, Amidi A, Vlachakis D, Paragios N, Zacharaki EI. Automatic
single- and multi-label enzymatic function prediction by machine
learning. PeerJ. 2017;5:e3095.
64. Wei L, Ding Y, Su R, Tang J, Zou Q. Prediction of human protein subcellular
localization using deep learning. J Parallel Distr Comput.
2018;117:212-217.
65. Yu CS, Lin CJ, Hwang JK. Predicting subcellular localization of proteins
for gram-negative bacteria by support vector machines based
on n-peptide compositions. Protein Sci. 2004;13:1402-1406.
66. Park KJ, Kanehisa M. Prediction of protein subcellular locations by
support vector machines using compositions of amino acids and
amino acid pairs. Bioinformatics. 2003;19:1656-1663.
67. Zhou X, Chen C, Li Z, Zou X. Using Chou's amphiphilic pseudo-amino
acid composition and support vector machine for prediction of
enzyme subfamily classes. J Theor Biol. 2007;248:546-551.
68. Kerepesi C, Daroczy B, Sturm A, Vellai T, Benczur A. Prediction and
characterization of human ageing-related proteins by using machine
learning. Sci Rep. 2018;8:4094.
69. Fernandez-Lozano C, Cuinas RF, Seoane JA, Fernandez-Blanco E,
Dorado J, Munteanu CR. Classification of signaling proteins based
on molecular star graph descriptors using machine learning models.
J Theor Biol. 2015;384:50-58.
70. Taju SW, Nguyen TT, Le NQ, Kusuma R, Ou YY. DeepEfflux: a 2D
convolutional neural network model for identifying families of efflux
proteins in transporters. Bioinformatics. 2018;34:3111-3117.
71. Yang R, Zhang C, Gao R, Zhang L. A novel feature extraction method
with feature selection to identify golgi-resident protein types from
imbalanced data. Int J Mol Sci. 2016;17:218.
72. Lin H, Ding H, Guo FB, Huang J. Prediction of subcellular location of
mycobacterial protein using feature selection techniques. Mol Divers.
2010;14:667-671.
73. Lee H, Tu Z, Deng M, Sun F, Chen T. Diffusion kernel-based logistic
regression models for protein function prediction. OMICS. 2006;10:
40-55.
74. Ni Q, Wang Z, Han Q, Li G, Wang X, Wang G. Using logistic regression
method to predict protein function from protein-protein interaction
data. Proceedings of IEEE International Conference on
Bioinformatics and Biomedical Engineering Beijing, China, 2009.
75. Jiang J, McQuay L. Predicting protein function by multi-label correlated
semi-supervised learning. IEEE/ACM Trans Comput Biol Bioinform.
2012;9:1059-1069.
76. Hu L, Huang T, Shi X, Lu WC, Cai YD, Chou KC. Predicting functions
of proteins in mouse based on weighted protein-protein interaction
network and protein hybrid properties. PLoS One. 2011;6:e14556.
77. You R, Huang X, Zhu S. DeepText2GO: improving large-scale protein
function prediction with deep semantic text representation.
Methods. 2018;145:82-90.
78. Zou Z, Tian S, Gao X, Li Y. mIDEEPre: multi-functional enzyme function
prediction with hierarchical multi-label deep learning. Front
Genet. 2019;9:714.
79. Rice SB, Nenadic G, Stapley BJ. Mining protein function from text
using term-based support vector machines. BMC Bioinformatics.
2005;6:S22.
80. Wong A, Shatkay H. Protein function prediction using text-based
features extracted from the biomedical literature: the CAFA challenge.
BMC Bioinformatics. 2013;14:S14.
81. Zheng W, Blake C. Using distant supervised learning to identify protein
subcellular localizations from full-text scientific articles.
J Biomed Inform. 2015;57:134-144.
82. Funk CS, Kahanda I, Ben-Hur A, Verspoor KM. Evaluating a variety
of text-mined features for automatic protein function prediction
with GOstruct. J Biomed Semant. 2015;6:9.
83. Shao W, Liu M, Zhang D. Human cell structure-driven model construction
for predicting protein subcellular location from biological
images. Bioinformatics. 2016;32:114-121.
84. Fa R, Cozzetto D, Wan C, Jones DT. Predicting human protein function
with multi-task deep neural networks. PLoS One. 2018;13:
e0198216.
85. Molina L, Belanche L, Nebot A. Feature selection algorithms: a survey
and experimental evaluation. Proceedings of IEEE International
Conference on Data Mining Maebashi City, Japan, 2002.
p. 306–313.
86. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques
in bioinformatics. Bioinformatics. 2007;23:2507-2517.
87. Wang L, Wang Y, Chang Q. Feature selection methods for big data
bioinformatics: a survey from the search perspective. Methods.
2016;111:21-31. http://www.sciencedirect.com/science/article/pii/
S1046202316302742 big Data Bioinformatics.
88. Frank E, Hall MA, Witten IH. The WEKA Workbench. Online Appendix
for “Data Mining: Practical Machine Learning Tools and Techniques”.
Morgan Kaufmann; 2016.
89. Tang H, Zhao YW, Zou P, et al. HBPred: a tool to identify growth
hormone-binding proteins. Int J Biol Sci. 2018;14:957-964.
90. Al-Shahib A, Breitling R, Gilbert DR. Franksum: new feature selection
method for protein function prediction. Int J Neural Syst. 2005;
15:259-275.
91. Rahman S, Bakar A, Hussein Z. Feature selection and classification
of protein subfamilies using rough sets. Proceedings of IEEE International
Conference on Electrical Engineering and Informatics
Selangor, Malaysia, 2009. p. 32–35.
92. Ding C, Peng H. Minimum redundancy feature selection from microarray
gene expression data. Proceedings of IEEE Conference on
Computational Systems Bioinformatics Stanford, USA, 2003.
93. Clark WT, Radivojac P. Analysis of protein function and its prediction
from amino acid sequence. Proteins. 2011;79:2086-2096.
94. Moreira IS, Koukos PI, Melo R, et al. SpotOn: high accuracy identification
of protein-protein interface hot-spots. Sci Rep. 2017;7:8007.
95. Santos BD, Nobre C, Zarate L. Multi-objective genetic algorithm for
feature selection in a protein function prediction context. Proceedings
of IEEE Congress on Evolutionary Computation Rio de Janeiro,
2018.
96. Fodeh S, Tiwari A, Yu H. Exploiting PubMed for protein molecular
function prediction via NMF based multi-label classification. Proceedings
of IEEE International Conference on Data Mining Workshops
New Orleans, USA, 2017. p. 446–451.
97. Maheshwari S, Brylinski M. Prediction of protein-protein interaction
sites from weakly homologous template structures using metathreading
and machine learning. J Mol Recognit. 2015;28:35-48.
98. Fabris F, Freitas A. An efficient algorithm for hierarchical classification
of protein and gene functions. Proceedings of IEEE International
Workshop on Database and Expert Systems Applications Munich,
Germany, 2014. p. 64–68.
99. Merschmann L, Freitas A. An Extended Local Hierarchical Classifier for
Prediction of Protein and Gene Functions. Berlin: Springer; 2013.
100. Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin
classifiers. Proceedings of 5th Annual ACM workshop on computational
learning theory. Proceedings of 5th Annual ACM Workshop
BONETTA AND VALENTINO 411
on Computational Learning Theory Pittsburgh, Pennsylvania, USA,
1992. p. 144–152.
101. Cai YD, Liu XJ, Xu X, Zhou GP. Support vector machines for
predicting protein structural class. BMC Bioinformatics. 2001;2:3.
102. Lanckriet GR, Deng M, Cristianini N, Jordan MI, Noble WS. Kernelbased
data fusion and its application to protein function prediction
in yeast. Pacific Symposium on Biocomputing Hawaii, USA, 2004.
p. 300–311.
103. Cozzetto D, Minneci F, Currant H, Jones DT. FFPred 3: featurebased
function prediction for all Gene Ontology domains. Sci Rep.
2016;6:31865.
104. Tang H, Zou P, Zhang C, Chen R, Chen W, Lin H. Identification of
apolipoprotein using feature selection technique. Sci Rep. 2016;6:
30441.
105. Zhang SB, Tang QR. Predicting protein subcellular localization based
on information content of gene ontology terms. Comput Biol Chem.
2016;65:1-7.
106. Badal VD, Kundrotas PJ, Vakser IA. Natural language processing in
text mining for structural modeling of protein complexes. BMC Bioinformatics.
2018;19:84.
107. Miranda L, Hu J. A deep learning approach based on stacked denoising
autoencoders for protein function prediction. Proceedings of
IEEE 42nd Annual Computer Software and Applications Conference
Tokyo, Japan, 2018. p. 480–485.
108. Freund Y, Shapire RE. A decision-theoretic generalization of on-line
learning and an application to boosting. J Comput Syst Sci. 1997;55:
119-139.
109. Friedman JH. Greedy function approximation: a gradient boosting
machine. Ann Stat. 2001;29:1189-1232.
110. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings
of the 22nd ACM Conference on Knowledge Discovery
and Data Mining San Francisco, USA, 2016. p. 785–794.
111. Breiman L. Random forests. Machine Learning, 2001.
112. Peled S, Leiderman O, Charar R, Efroni G, Shav-Tal Y, Ofran Y. Denovo
protein function prediction using DNA binding and RNA binding
proteins as a test case. Nat Commun. 2016;7:13424.
113. Yu G, Rangwala H, Domeniconi C, Zhang G, Yu Z. Protein function
prediction using multilabel ensemble classification. IEEE/ACM Trans
Comput Biol Bioinform 2013;10:1045–1067.
114. Guo X, Liu F, Ju Y, Wang Z, Wang C. Human protein subcellular
localization with integrated source and multi-label ensemble classifier.
Sci Rep. 2016;6:28087.
115. Quinlan J. C4.5: Programs for Machine Learning. Boston: Morgan
Kaufmann Publishers; 1993.
116. Cerri R, Basgalupp M, Mantovani R, de Carvalho A. Multi-label feature
selection techniques for hierarchical multi-label protein function
prediction. Proceedings of IEEE International Joint Conference
on Neural Networks Rio de Janeiro, Brazil, 2018.
117. Vens C, Struyf J, Shetgat L, Dzeroski S, Blockeel H. Decision trees
for hierarchical multi-label classification. Mach Learn. 2008;73:
185-214.
118. Yang J, Yang M. Assessing protein function using a combination of
supervised and unsupervised learning. Proceedings of IEEE Symposium
on Bioinformatics and Bioengineering Arlington, USA, 2006.
p. 35–44.
119. Cerri R, Barros RC, de Carvalho A, Jin Y. Reduction strategies for
hierarchical multi-label classification in protein function prediction.
BMC Bioinformatics. 2016;17:373.
120. Nievola J, Paraiso E, Freitas A. A hierarchical neural network for
predicting protein functions. Proceedings of IEEE International Conference
on Bioinformatics and Bioengineering Belgrade, Serbia, 2015.
121. Borges H, Nievola J. Multi-label hierarchical classification using a
competitive neural network for protein function prediction. Proceedings
of International Joint Conference on Neural Networks
Brisbane, Australia, 2012. p. 172–177.
122. Smale S, Rosasco L, Bouvrie J, Caponnetto A, Poggio T. Mathematics
of the neural response. Found Comput Math. 2010;10:67-91.
123. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:
436-444.
124. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning
applied to document recognition. Proc IEEE. 1998;86:2278-2324.
125. Simonyan K, Zisserman A. Very deep convolutional networks for
large-scale image recognition; 2015.
126. Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with
deep convolutional neural networks. Proceedings of Neural Information
Processing Systems Conference Lake Tahoe, USA, 2012.
p. 1106–1114.
127. Zacharaki E. Prediction of protein function using a deep convolutional
neural network ensemble. PeerJ Comput Sci. 2017;3:e124.
128. Pearlmutter B. Learning state space trajectories in recurrent neural
networks. Neural Comput. 1989;1:263-269.
129. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput.
1997;9:1735-1780.
130. Graves A, Mohamed A, Hinton G. Speech recognition with deep
recurrent neural networks. Proceedings of IEEE International Conference
on Acoustics, Speech and Signal Processing Vancouver,
Canada, 2013. p. 6645–6649.
131. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep
neural network for quantifying the function of DNA sequences.
Nucleic Acids Res. 2016;44:e107.
132. Cao R, Freitas C, Chan L, Sun M, Jiang H, Chen Z. ProLanGO: protein
function prediction using neural machine translation based on a
recurrent neural network. Molecules. 2017;22:E1732.
133. Wu JS, Huang SJ, Zhou ZH. Genome-wide protein function prediction
through multi-instance multi-label learning. IEEE/ACM Trans
Comput Biol Bioinform. 2014;11:891-902.
134. Dietteric R, Lathrop R, Lozano-Perez T. Solving the multiple instance
learning problem with axis-parallel rectangles. Artif Intell. 1997;89:31-71.
135. Zhou Z, Zhang M, Huang S, Li Y. Multi-instance multi-label learning.
Artificial Intelligence. Artificial Intelligence. 2012;176:2291-2320.
136. Pawlak Z. Rough sets. Int J Comput Inf Sci. 1982;11:341-356.
137. Liu L, Tang L, He S, Yao S, Zhou W. Predicting protein function via
multi-label supervised topic model on gene ontology. Biotechnol Biotechnol
Equip. 2017;31:630-638.
138. Ramage D, Hall D, Nallapati R, Manning C. Labeled LDA: a supervised
topic model for credit attribution in multi-labeled corpora. Proceedings
of Conference on Empirical Methods in Natural Language
Singapore, 2009. p. 248–256.
139. Cheng X, Lin WZ, Xiao X, Chou KC. pLoc_bal-mAnimal: predict subcellular
localization of animal proteins by balancing training dataset
and PseAAC. Bioinformatics. 2019;35:398-406.
140. Pedregosa F et al. Scikit-learn: machine learning in Python. J Mach
Learn Res. 2011;12:2825-2830.
141. Spectrum I, The Top Programming Languages in 2018; 2018.
https://spectrum.ieee.org/static/interactive-the-top-programming-
languages-2018.
142. The MathWorks I, MATLAB and Statistics Toolbox Release 2018b;
2018.
143. Adabi M, et al. TensorFlow: a system for large-scale machine learning.
Proceedings of the 12th USENIX Symposium on Operating Systems
Design and Implementation Savannah, USA, 2016. p. 265–283.
144. Chollet F, et al.; 2015. https://keras.io.
145. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, et al.
Caffe: convolutional architecture for fast feature embedding. Proceedings
of ACM International Conference on Multimeda Orlando,
USA, 2014. p. 675–678.
146. Paszke A, et al. Automatic differentiation in PyTorch. Proceedings of Neural
Information Processing Systems Conference. Proceedings of Neural
Information Processing Systems Conference Long Beach, USA, 2017.
412 BONETTA AND VALENTINO
147. Kingma D, Ba J. Adam: a method for stochastic optimization. Proceedings
of International Conference on Learning Representations
San Diego, USA, 2015.
148. Tielman T, Hinton G. Lecture 6.5 - rmsprop: Divide the Gradient by
a Running Average of its Recent Magnitude, 2012.
149. Mercado-Diaz L, Navarro-Garcia J, Jaramillo-Garzon J. A comparison
of class-balance strategies for SVM in the problem of protein function
prediction. Proceedings of 20th Symposium on Signal
Processing, Images and Computer Vision Bogota, Colombia, 2015.
150. Lu Z, Szafron D, Greiner R, et al. Predicting subcellular localization
of proteins using machine-learned classifiers. Bioinformatics. 2004;
20:547-556.
151. Li Y, Maleki N, Carruthers N, Rueda L, Stemmer P, Ngom A. Prediction
of calmodulin-binding proteins using short-linear motifs. Proceedings
of International Conference on Bioinformatics and
Biomedical Engineering Granada, Spain, 2017. p. 107–117.
152. Santis ED, Martino A, Rizzi A, Mascioli F. Dissimilarity space representation
and automatic feature selection for protein function prediction.
Proceedings of International Joint Conference on Neural
Networks Rio de Janeiro, Brazil, 2018.
153. Ahmed I, Witbooi P, Christoffels A. Prediction of human-Bacillus
anthracis protein-protein interactions using multi-layer neural network.
Bioinformatics. 2018;34:4159-4164.
154. Ruepp A, Zollner A, Maier D, et al. The FunCat, a functional annotation
scheme for systematic classification of proteins from whole
genomes. Nucleic Acids Res. 2004;32:5539-5545.
155. Nomenclature Committee of the International Union of Biochemistry
and Molecular Biology on the Nomenclature and Classification
of Enzymes. Enzyme Nomenclature. San Diego, CA: Elsevier; 1992.
156. Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW,
et al. The CAFA challenge reports improved protein function prediction
and new functional annotations for hundreds of genes through
experimental screens. bioRxiv 2019;https://www.biorxiv.org/
content/early/2019/05/29/653105.
157. Wu J, Zhu W, Jiang Y, Sun G, Gao Y. Predicting protein functions of
bacteria genomes via multi-instance multi-label active learning. Proceedings
of IEEE International Conference on Integrated Circuits
and Microsystems Shanghai, China 2018. p. 302–307.
158. Tung CH, Chen CW, Sun HH, Chu YW. Predicting human protein
subcellular localization by heterogeneous and comprehensive
approaches. PLoS One. 2017;12:e0178832.
159. Youngs N, Penfold-Brown D, Bonneau R, Shasha D. Negative example
selection for protein function prediction: the NoGO database.
PLoS Comput Biol. 2014;10:e1003644.
160. Bergstra J, Yamins D, Cox DD. Making a science of model search:
Hyperparameter optimization in hundreds of dimensions for vision
architectures. Proceedings of the 30th International Conference on
International Conference on Machine Learning - Volume
28 ICML'13, JMLR.org; 2013. p. I–115–I–123. http://dl.acm.org/
citation.cfm?id=3042817.3042832.
How to cite this article: Bonetta R, Valentino G. Machine
learning techniques for protein function prediction. Proteins.
2020;88:397–413. https://doi.org/10.1002/prot.25832
BONETTA AND VALENTINO 413