R E S E A R C H A R T I C L E
cnnAlpha: Protein disordered regions prediction by reduced
amino acid alphabets and convolutional neural networks
Mauricio Oberti1,2
| Iosif I. Vaisman1
1
School of Systems Biology, George Mason
University, Manassas, Virginia
2
Novartis Institutes for BioMedical Research,
Cambridge, Massachussets
Correspondence
Iosif I. Vaisman, School of Systems Biology,
George Mason University, 10900 University
Blvd, MS 5B3, Manassas, VA 20110.
Email: ivaisman@gmu.edu
Abstract
Intrinsically disordered regions (IDR) play an important role in key biological processes
and are closely related to human diseases. IDRs have great potential to serve
as targets for drug discovery, most notably in disordered binding regions. Accurate
prediction of IDRs is challenging because their genome wide occurrence and a low
ratio of disordered residues make them difficult targets for traditional classification
techniques. Existing computational methods mostly rely on sequence profiles to
improve accuracy which is time consuming and computationally expensive. This article
describes an ab initio sequence-only prediction method—which tries to overcome
the challenge of accurate prediction posed by IDRs—based on reduced amino acid
alphabets and convolutional neural networks (CNNs). We experiment with six different
3-letter reduced alphabets. We argue that the dimensional reduction in the input
alphabet facilitates the detection of complex patterns within the sequence by the
convolutional step. Experimental results show that our proposed IDR predictor performs
at the same level or outperforms other state-of-the-art methods in the same
class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available
Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our
method is suitable for proteome-wide disorder prediction yielding similar or better
accuracy than existing approaches at a faster speed.
K E Y W O R D S
convolutional neural networks, disordered proteins, machine learning
1 | INTRODUCTION
Intrinsically disordered proteins (IDP) or intrinsically disordered regions
(IDR) are segments within a protein chain lacking a stable threedimensional
structure under normal physiological conditions. They have
been known to scientists for over 50 years and since then, linked to key
biological processes including regulation of transcription, signal transduction,
cell cycle control, post-translational modifications, ligand binding,
protein interaction, and alternative splicing.1,2
Disorder regions exist in
up to half of the amino acids in eukaryotic proteins.3
At least 6% of all
residues in SwissProt are believed to be within disordered regions.4
Experimental structure resolution of IDP/IDRs is complex, lengthy
and expensive. DisProt database,5
a community resource annotating
protein sequences for intrinsically disordered regions, currently contains
just over 800 proteins. A large number of computational prediction
methods have been developed6,7
because of this inherent
complexity. Existing methods can be classified in one of the following
categories8
: (i) Ab initio or sequence based. They rely almost exclusively
on amino acid sequence information to make a prediction. Features
extracted from the primary sequence, alignment profiles or
scoring matrices are used as input for statistical models which then
make predictions of disorder regions. Generally, methods that do not
rely on complex external sources of information fall into this category
and are referenced as sequence-only. (ii) Clustering. This approach
generates tertiary structure models from the primary sequence. It
then superimposes the different models onto each other with the
Received: 3 January 2019 Revised: 18 November 2019 Accepted: 6 June 2020
DOI: 10.1002/prot.25966
1472 © 2020 Wiley Periodicals LLC Proteins. 2020;88:1472–1481.wileyonlinelibrary.com/journal/prot
assumption that positions in ordered regions will be conserved across
models. (iii) Template based. Similar to clustering, template based
method predicts disordered regions of proteins by aligning the input
sequence to homologous proteins with a known structure. Homologous
proteins are found by doing a database search or by fold recognition
methods. (iv) Meta or consensus. They combine the output of
several disordered predictors into a single average, which tends to
have a moderate increase in accuracy. Evolutionary information contained
in sequence profiles helps ab initio methods to improve prediction
accuracy. However, generating sequence profiles is time
consuming and methods relying on them for predictions may not be
suitable for large proteome-wide analysis.
This article presents a sequence-only ab initio method for
predicting protein disorder based on reduced amino acid alphabets
and convolutional neural networks (cnnAlpha). Our method relies
solely on the amino acid sequence for determining disorder positions
and is aimed to proteome-wide applications where speed and low
false positive rate are prioritized over maximum accuracy.9
Among the main challenges with sequence based prediction methods
are (a) the highly class imbalance nature of the datasets and (b) the difficulty
in accurately capturing the interdependency of adjacent residues in determining
the transitions between disorder and order states. If not addressed,
a class imbalance can severely bias predictions toward the majority class
(order state). To solve the imbalance problem, we choose an undersampling
technique where we randomly remove examples from the majority class
until we have a balanced dataset. Undersampling has been proven to be
highly successful yielding a positive performance within the context of convolutional
networks and extreme ratio imbalance datasets.10
In order to
capture local sequence context, we use a sliding window approach which
feeds into a convolutional neural network that is tasked with learning rich
higher-order sequence features.
Convolutional neural networks proved to be very efficient and
well performing in the field of computer vision, excelling in tasks such
as object detection and image classification.11
The adaptation of convolutional
neural networks architectures for biological problems has
been successful in the context of DNA-protein binding prediction12
and DNA function modeling.10
Reducing the amino acid alphabet from
20 to 3 letters enables a seamless adaptation of convolutional neural
networks for protein models. Instead of analyzing 2-D images with
three color channels (R, G, B), fixed length protein sequence windows
are mapped to 1-D input vectors with three channels. This translation
allows mapping the protein disorder prediction problem to the 2-class
image classification problem in the computer vision domain.
2 | METHODS AND MATERIALS
2.1 | Disorder definition and feature extraction
There is no universal agreement on how to define disorder residues from
PDB files.13
In the context of this work, we consider a residue to be in a
disorder position if it appears in the sequence records, but its coordinates
are missing from the electron density map. We annotated our PDB
training and CAMEO validation sets using this definition. The annotation
provided by the CASP experiments14
was created using a similar definition.
This is not a perfect definition since there are other reasons why a
residue can have missing coordinates (i.e., crystallization artifacts). However,
it allows us to use a large number of proteins from PDB without
further experimental validation.
The primary sequences from our training set had to be translated
to numerical features to be fed into the convolutional network. For
that purpose, we implemented a 101-residue length sliding window
centered on the target residue. The window length was set after
experimenting with different sizes, finding that larger windows were
more consistent in capturing disorder information. For each window,
residues are represented by letters from the reduced amino acid
alphabet and encoded using a one-bit hot encoding scheme. This generates
a 3-D input feature matrix per target residue of size [3 × 101].
This process is illustrated in Figure 1.
2.2 | Reduced alphabets
Reduced alphabets cluster residues in ways that prevent the loss of key
biochemical information. The 20-letter amino acid alphabet was reduced
to a 3-letter alphabet in order to simplify and quicken the network learning
process, reducing the number of possible encodings and size of the
input feature vectors. The reduced alphabets were selected from the literature
(Table 1), where each was designed with a specific structural protein
task in mind. In each alphabet, residues are clustered based on
various properties, including chemical and genetic properties.
We found that 3-letter alphabets provide a reasonable balance
between limiting the complexity of the sequence space and maintaining
the model's ability to efficiently predict disordered residues. Higher-order
alphabets (in particular between 4 and 10 letters) better characterize the
complexity in proteins.16
In our case, their usage increases the number of
trainable parameters, complexity of the network model, and require
larger training sets to converge. This is in part supported by the results in
Table 4, where the performance of the model without alphabet reduction
is consistently below the models using a reduction step. A comprehensive
search of published alphabets and groupings is beyond the
scope of this work and might be addressed in future studies.
Alphabets 1, 2, and 6 performed better in our specific classification
task. Alphabet 1 achieves the reduction by mismatch minimization
between the reduced interaction matrix and the Miyazawa and Jernigan
(MJ) matrix. Alphabet 2 identifies the reduced alphabet which simplified
sequence performs best in the context of protein fold recognition using
global sequence alignments with the parent sequence. Alphabet 6 implements
an automated reduction protocol using information theory metrics
tailored to the prediction of solvent accessibility.
2.3 | Convolutional neural network architectures
The convolutional neural network architectures used in our models
are variations of Figure 2. The input is a 3 × L matrix where L is the
OBERTI AND VAISMAN 1473
length of the sequence window (101 residues). Each symbol of the
3-letter reduced alphabet is mapped to one of the three one-hot bit
encoded vectors (B = [0,0,1], J = [0,1,0], U = [1,0,0]).
The first layer of our network is a convolutional layer, step size
1 and window size of 32. The output of each neuron on a convolutional
layer is the convolution of the kernel matrix. The second layer is a maxpooling
layer, one for each convolutional layer. Each of these maxpooling
layers only outputs the maximum value (global or local) of its
respective convolutional layer outputs. The third layer is a fully connected
layer of size 256 where each of its neurons is connected to all of
the neurons in the max-pooling layer. We use a dropout layer21
after the
fully connected layer to avoid overfitting. The final output layer consists
of two neurons corresponding to the two classification results. These
two neurons are fully connected to the previous layer. Table 2 highlights
the differences among each of the tested models.
2.4 | Network training details
We train our models using stochastic gradient descent (SGD) with
mini batches of size 128. SGD works by utilizing chain ruling which
takes the partial derivative of the loss function with respect to each
weight vector in the network, and uses the derivative to update the
weights. We use a version of SGD with support for momentum and
learning rate decay with default parameters and a learning rate set to
1e-3. All models are trained using the same setup and configuration
the only difference being the seeds for initializing weights. We use
early stopping, based on the validation set in order to pick the optimal
set of weights. We train all our neural network models on AWS using
G3 instances (NVIDIA Tesla M60 GPU) using python Keras libraries22
running on top of TensorFlow library to assure model portability.
3 | RESULTS
3.1 | Training, validation, and evaluation datasets
Publicly available datasets are used to train, validate and evaluate the
performance of our method. High resolution X-ray crystal structures
from the Protein Data Bank (PDB)23
are used to construct the training
and validation data sets while CASP14
and CAMEO24
(http://www.
cameo3d.org) are used for further validation. Figure 3 and Table 3 show
the protein length distribution for training, testing and validation sets.
We use the Pisces protein sequence culling server (http://
dunbrack.fccc.edu)25
to extract sequences from PDB, filter for high
TABLE 1 Six reduced alphabets and their sources
Alphabet reference Letter 1 (B) Letter 2 (J) Letter 3 (U)
a115
CFILMVWY AGHPRT DEKNQS
a216
CFILMVWY AGPST DEHKNQR
a317
AFGILMPV DEKR CHNQSTWY
a418
DHIMNVY EFKLQ ACGPRSTW
a519
ACGILMPSTV EKRDNQH FYW
a620
CFILMVWY AGHST DEKNPQR
Note: Each letter contains a cluster of amino acid residues (one-letter
abbreviations). The residue clusters were denoted by the letters “B”, “J”,
and “U”.
FIGURE 1 Sequence encoding, window generation, and feature extraction steps using sliding window approach
1474 OBERTI AND VAISMAN
resolution and reduce redundancy. Parameters selected for culling are
(i) proteins sharing less than 25% sequence identity (ii) resolution better
than 1.8 Angstroms (iii) R value up to 0.30. In total, 7119 proteins
are retrieved from PDB with an average length of 349 residues. The
original dataset is then undersampled to create a 50/50 class balanced
set, containing 181 060 examples. The effect of class imbalance is
very detrimental to classification performance. In cases of an extreme
ratio of imbalance, undersampling has been shown to perform on a par
with oversampling without the risk of overfitting.26
Undersampling has
the additional advantage of reducing training times given that the training
set is smaller in size.
The balanced dataset was randomly partitioned into ten equally
sized subsets and a ten-fold cross-validation was performed to determine
the optimal parameters for (a) convolutional network architecture
and (b) encoding reduced protein alphabet (Section 3.4). At each
step of the cross validation, one subset is selected and used as
validation set while the remaining nine are used as training set. This
process is repeated until all subsets are validated, results for each of
the parameters tested are shown in Tables 4 and 5.
CASP10 is the latest dataset available from the series experiments,
which released specific targets for protein disorder prediction.
The 94 available targets are used for initial validation and as an independent
benchmark set. Finally, to further assess and compare our
method, we tested it against CAMEO 6 months targets released from
August 26, 2017 to February 18th, 2018 (504 targets, categorized in
three groups). Since CAMEO targets were released after the construction
of our PDB training set, there is no sequence overlap between
the two set. However, CASP10 targets were already present in PDB
at the time of extraction. To prevent any redundancy between sets,
we use BLASTClust27
to filter and remove sequences from the PDB
training set sharing at least 25% identity with sequences in the
CASP10 set.
3.2 | Metrics and evaluation criteria
Disorder data is characterized by high class imbalance, disordered residues
account for less than 5% of the data in the PDB set (training
and test). Since disordered residues are relatively rare compared to
ordered ones, they are harder to predict. Performance metrics should
account for this imbalance and reward correct prediction of disordered
residues higher than correct prediction of ordered ones.28
We
selected a subset of the metrics commonly used for the assessment of
disorder data14,29,30
that take into account the nature of the
FIGURE 2 Basic 1-layer CNN
architecture shared among all
models [Color figure can be
viewed at wileyonlinelibrary.com]
TABLE 2 Description of the CNN architectures tested
Method Architecture description
64-ker-local 1-convolutional layer, 64 kernels, local max pooling
128-ker-local 1-convolutional layer, 128 kernels, local max pooling
64-ker-global 1-convolutional layer, 64 kernels, global max pooling
128-ker-
global
1-convolutional layer, 128 kernels, global max
pooling
2-conv-local 2-convolutional layers, [64, 32] kernels, local max
pooling
OBERTI AND VAISMAN 1475
imbalanced data: (i) specificity (ii) sensitivity (iii) balanced accuracy
(iv) Matthews correlation coefficient and (v) AUC.
3.3 | Binary metrics
Specificity =
TN
TN + FP
ð1Þ
Sensitivity =
TP
TP + FN
ð2Þ
BalanceAcc =
TP
TP + FN
+
TN
TN + FP
ð3Þ
MCC =
TP × TN−FP × FN
TP + FPð Þ TP + FNð Þ TN + FPð Þ TN + FNð Þ
ð4Þ
True positives (TP) and true negatives (TN) are the numbers of
correctly predicted disordered and ordered residues. False positives
(FP) and false negatives (FN) are the numbers of incorrectly predicted
disordered and ordered residues.
3.4 | Statistical metrics
The Receiver Operating Characteristic (ROC) curve is a plot that
compares the true positive rate against the false positive rate under
various threshold values for a binary classifier. ROC curve represents
a monotonic function describing the balance between the
true positive and false positive rates of a predictor.31
For a set of
probability thresholds (from 0 to 1), a residue is considered as a
positive example (disordered) if its predicted probability is equal to
or greater than the threshold value. The area under the curve
(AUC) is used as an aggregate measure of the overall quality of a
prediction method. AUC has a minimum value 0, a random value
0.5 and a perfect value 1.
3.5 | Comparison with other methods
To benchmark our method we selected the following methods:
Espritz,32
Disopred3,33
IUPred,34
and ngramsAlpha.35
Given that
our predictor is sequence-based, we compared our results with
similar methods and we leave out clustering, template and meta
based approaches. Espritz is an ensemble of sequence-only and
multiple sequence alignments disorder prediction methods. The
sequence-only method has three different versions, depending on
FIGURE 3 Protein length
distribution in training, test and
validation sets [Color figure can
be viewed at
wileyonlinelibrary.com]
TABLE 3 Distribution of disordered regions by length on the
three main datasets used
Dataset
Number of fragments
1-5 6-15 16-25 >25
CASP10 21 41 11 3
CAMEO 143 114 27 11
PDB 768 657 127 37
1476 OBERTI AND VAISMAN
the initial set used for training (X-ray, NMR, Disprot). We used
X-ray trained version since it is the one that performs best among
the three. Disopred3 runs a PSI-BLAST search for each of the
residues in a 15-residue window. The profile is then used as input
to a neural network classifier which outputs a probability estimate
of the residue being disordered.
IUPred method is based on estimating the capacity of polypeptides
to form stabilizing contacts. It has two prediction modes: IUPred
(Long) and IUPred (Short). Each mode optimizes predictions for either
long or short disordered regions. Finally, ngramsAlpha is our previously
published predictor based on n-grams frequencies and reduced
protein alphabets.
3.6 | Parameter and model selection
In order to select the best performing model, we experimented with
two of the components of our method while leaving the remaining
parameters constant. In particular, we tested several network architectures
and reduced amino acid alphabets and analyzed their
effect on the model predictive value. We performed a ten-fold
cross-validation, using the mean AUC across validation batches as
the primary metric to compare performance. Values for parameters
such as dropout and learning rate, optimizer, and window size have
been selected after performing a hyperparameter search across a
reduced size training set and are left constant.
3.7 | Alphabet selection
Using reduced alphabets has two main advantages: (i) cluster residues
with similar biochemical properties providing additional information to
the original sequence and (ii) reduce the amino acid space from 20 to
3 residues, reducing, in turn, the model complexity and amount of
data required for training. We tested six different alphabets from the
literature and analyzed which performed better in the context of our
classification problem. We used the (2-conv-local) network architecture
across all runs. A modified version of the network using the full
amino acid alphabet as input (no alphabet reduction step) is included
for comparison. The effect of alphabet selection is shown in Table 4.
Across the ten validation batches, we found that alphabets 1, 2, and
6 achieved better overall performance than alphabets 3, 4, and
5. Results also show that all six alphabets outperformed the model
where no alphabet reduction was applied. We selected alphabet 6 for
our final model implementation based on the results shown in
Table 4.
Disordered regions are characterized by a high content of polar
and charged amino acids (disorder-promoting residues) and low content
of hydrophobic residues (order-promoting residues).36
Despite
being created with different objectives, alphabets (1, 2, 6) cluster most
disorder-promoting residues within the same group (Table 1). Alphabets
differ in the composition of the other two groups, which contain
a mix of order-promoting and ambiguous residues. The relationship
between net charge and hydrophobicity has been explored by other
TABLE 5 Model cross validation
Model
AUC value of 10 cross validation batch datasets
1 2 3 4 5 6 7 8 9 10 Mean
64-ker-local 88.18% 89.59% 87.64% 89.63% 88.42% 87.48% 87.96% 88.02% 88.29% 87.98% 88.32%
128-ker-local 88.37% 89.56% 87.78% 89.60% 88.44% 87.67% 87.83% 88.25% 88.40% 87.97% 88.39%
64-ker-global 87.51% 88.24% 85.83% 89.00% 87.47% 86.38% 86.61% 87.02% 87.34% 86.80% 87.22%
128-ker-global 87.40% 88.52% 86.42% 89.13% 87.67% 85.89% 86.75% 87.61% 87.02% 86.82% 87.32%
2-conv-local 87.93% 89.11% 87.33% 89.28% 88.47% 87.40% 87.86% 87.80% 87.89% 87.76% 88.08%
Note: Bold value highlights the best performant run/method within the column or row.
TABLE 4 Alphabet cross validation
Alphabet
AUC value of 10 cross validation batch datasets
1 2 3 4 5 6 7 8 9 10 Mean
Alphabet 1 87.55% 89.26% 86.47% 88.55% 88.85% 87.23% 87.44% 87.53% 87.54% 87.71 87.84%
Alphabet 2 87.79% 88.86% 87.62% 89.31% 88.50% 87.17% 87.49% 88.01% 88.01% 87.83% 88.09%
Alphabet 3 83.00% 86.32% 82.87% 86.63% 83.92% 82.89% 83.59% 84.39% 84.54% 84.69% 84.43%
Alphabet 4 81.87% 86.08% 83.93% 86.22% 85.07% 81.42% 84.18% 83.84% 83.87% 85.12% 84.41%
Alphabet 5 85.29% 87.10% 84.07% 87.44% 85.12% 83.85% 86.00% 84.84% 85.32% 85.73% 85.50%
Alphabet 6 87.39% 89.51% 87.48% 89.02% 88.92% 87.54% 87.55% 87.66% 87.66% 87.99% 88.15%
No alphabet 82.15% 85.52% 82.96% 86.01% 83.09% 81.70% 82.94% 82.69% 83.60% 83.54% 83.42%
Note: Bold value highlights the best performant run/method within the column or row.
OBERTI AND VAISMAN 1477
IDP predictors before.37
It is the patterns and high-order relationships
between residues groups uncovered by the convolutional step that
enables our method to achieve its high accuracy.
It may be possible for a neural network to learn the optimal residue
groups given sufficient training data. Given our limited training
set and network architecture, this wasn't possible to achieve. The
adapted model that takes as input the original 20-letter alphabet, did
not converge and underperformed when compared to the models
using the reduction step (Table 4). These results highlight the benefit
of the dimensionality reduction step before training our models.
3.8 | Convolutional network architecture
To test the relationship between network architecture and performance,
we trained five different networks models and evaluated their
predictive value. We adapted models successfully used in the DNA
space to predict DNA-protein binding and function10,12
hoping they
would also perform well in the 3-letter reduced amino acid space. Our
models differ in the number of kernels (50, 64, 128), the number of
convolutional layers (1, 2) and max-pooling layer implementation
(global vs local). We found that the number of convolutional layers
does not seem to have a great impact on performance. Models with a
higher number of convolution kernels and local pooling implementation
achieved better overall classification performance. Based on the
results shown in Table 5, we selected 128-ker-local model.
3.9 | Method performance
Figures 4, 5 and Tables 6, 7 compare the performance of our
method against Disopred3, Espritz, IUPred, and ngramAlpha. It is
worthwhile to mention that—of the listed methods—Disopred is
the only to make use of additional evolutionary information
through sequence profiles (performing PSI-BLAST38
searches for
each input protein). This added evolutionary information gives the
method an extra advantage in performance but comes at the cost
of execution time. The other three methods are similar in nature to
ours, using sequence-only information to make disorder/order predictions.
All methods were downloaded and ran locally in a Linux
server using default parameters.
In terms of balanced accuracy (B.Acc), our method outperforms
all others on the two independent validation datasets. With respect to
area under the ROC curve (AUC) and MCC, our method performs
much better than the predictors not using sequence profiles (such as
IUpred and Espritz) and nears the performance of Disopred3 for AUC
on both validation sets.
The performance of the method was also evaluated on disordered
regions of various lengths for the CASP10 dataset and compared
with the other top performance methods. The percentage of
residues correctly predicted to be disordered is reported in Table 8.
While Espritz performs better on short length disorder regions,
Disopred3 and cnnAlpha achieve better results on mid and long disordered
regions.
FIGURE 4 ROC curves for the evaluation set
targets comparing the performance of the top
four models (CASP targets) [Color figure can be
viewed at wileyonlinelibrary.com]
1478 OBERTI AND VAISMAN
3.10 | Large-scale predictions
Finally, we evaluate the speed at which our method performs predictions
on a large scale. We compared our method's execution time with
Disopred3 since they both ranked on top of our evaluation.
The two applications were installed locally on a standard Linux
server (Amazon EC2 m5.xlarge, 4 CPUs/16GB memory). To make predictions,
Disopred3 uses PSSM values obtained after three search iterations
of PSI-BLAST.33
BLAST tool and UniRef90 database were installed
locally for that purpose. We created a script that takes as input
FIGURE 5 ROC curves for the evaluation set
targets comparing the performance of the top
four models (CAMEO hard targets) [Color figure
can be viewed at wileyonlinelibrary.com]
TABLE 6 Performance of predictors
on CASP10 dataset
Method Sequence profile B.Acc Sens Spec MCC AUC
Disopred3 Yes 0.64 0.32 0.97 0.32 0.86
cnnAlpha No 0.75 0.64 0.85 0.31 0.85
Espritz No 0.72 0.54 0.89 0.30 0.82
ngramAlpha No 0.72 0.61 0.83 0.26 0.79
IUPred (short) No 0.63 0.31 0.95 0.26 0.66
IUPred (long) No 0.57 0.17 0.96 0.15 0.60
Note: Metrics shown: balanced accuracy (B.Acc), sensitivity (Sens), specificity (Spec) Mattehews correlation
coefficient (MCC), and area under the ROC curve (AUC).
TABLE 7 Performance of predictors
on CAMEO dataset
Method Sequence profile B.Acc Sens Spec MCC AUC
Disopred3 Yes 0.72 0.48 0.96 0.43 0.86
cnnAlpha No 0.75 0.61 0.88 0.36 0.83
Espritz No 0.75 0.64 0.88 0.35 0.81
ngramAlpha No 0.73 0.56 0.89 0.33 0.79
IUPred (short) No 0.71 0.47 0.94 0.36 0.80
IUPred (long) No 0.64 0.35 0.93 0.27 0.73
Note: Metrics shown: balanced accuracy (B.Acc), sensitivity (Sens), specificity (Spec) Mattehews correlation
coefficient (MCC), and area under the ROC curve (AUC).
OBERTI AND VAISMAN 1479
parameters the method name and a list of target proteins FASTA files,
performs predictions, and saves the results to an output file. We timed
the execution of each script run via the Linux time command.
The execution time needed to perform predictions on the
CASP10 dataset (94 proteins, 25 370 residues) is reported in Table 9.
The large difference in execution time is explained by the fact that
extracting features and performing a forward pass in a previously
trained neural network is extremely fast when compared to running
multiple PSI-Blast searches. That makes our method several orders of
magnitude faster than Disopred3 and still capable of achieving similar
accuracy.
4 | DISCUSSION
This paper presents cnnAlpha, a new convolutional neural networkbased
method for protein disorder prediction using sequence information.
We demonstrated that our combination of amino acid alphabet
reduction strategy and convolutional neural networks leads to an
approach which can successfully compete with more elaborated and
computationally expensive sequence based algorithms. The source
code for an R/Shiny application with the model implementation of our
predictor can be found at https://github.com/mauricioob/shiny-pred.
CNNs are good at learning rich higher-order sequence features,
such as secondary motifs and local sequence context. We believe that
the reduction in dimension from 20 to 3 letter amino acid alphabet
helped the convolutional layer to better detect these relationships
and patterns. The reduction in dimensionality and our under sampling
approach to the class imbalance problem have the additional advantage
of reducing the amount of data required by the training sets.
This, in turn, made our models faster to train and allow us further
experimentation in parameter setting.
Overall, our method outperforms similar sequence-only algorithms
across both evaluation data sets and nears the performance of
sequence based methods using additional evolutionary information
(sequence profiles). Being several orders of magnitude faster than
sequence profile based approaches, our method is suitable for highthroughput
predictions at the proteomic scale. The high specificity of
cnnAlpha also ensures a low false positive rate on high-throughput
contexts, making it even more suitable for this task.
ACKNOWLEDGMENT
The authors are grateful for the computational facilities provided by
Novartis Institutes for BioMedical Research.
CONFLICT OF INTEREST
The authors declare no conflict of interest.
PEER REVIEW
The peer review history for this article is available at https://publons.
com/publon/10.1002/prot.25966
ORCID
Mauricio Oberti https://orcid.org/0000-0001-7107-2616
REFERENCES
1. Dunker AK, Oldfield CJ, Meng J, et al. The unfoldomics decade: an
update on intrinsically disordered proteins. BMC Genomics. 2008;9
(Suppl 2):S1.
2. Oldfield CJ, Dunker AK. Intrinsically disordered proteins and intrinsically
disordered protein regions. Annu Rev Biochem. 2014;83:553-584.
3. Dunker AK, Bondos SE, Huang F, Oldfield CJ. Intrinsically disordered
proteins and multicellular organisms. Semin Cell Dev Biol. 2015;37:44-55.
4. Di Domenico T, Walsh I, Martin AJM, Tosatto SCE. MobiDB: a comprehensive
database of intrinsic protein disorder annotations. Bioinformatics
(Oxford, England). 2012;28(15):2080-2081.
5. Sickmeier M, Hamilton JA, LeGall T, et al. DisProt: the Database of
Disordered Proteins. Nucleic Acids Res. 2007;35(suppl 1):D786-D793.
http://nar.oxfordjournals.org/content/35/suppl_1/D786.
6. He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK. Predicting
intrinsic disorder in proteins: an overview. Cell Res. 2009;19(8):
929-949.
7. Deng X, Eickholt J, Cheng J. A comprehensive overview of computational
protein disorder prediction methods. Mol BioSyst. 2012;8(1):114-121.
8. Atkins JD, Boateng SY, Sorensen T, McGuffin LJ. Disorder prediction
methods, their applicability to different protein targets and their usefulness
for guiding experimental studies. Int J Mol Sci. 2015;16(8):
19040-19054. http://www.mdpi.com/1422-0067/16/8/19040.
9. Sirota FL, Ooi HS, Gattermayer T, Schneider G, Eisenhaber F, MaurerStroh
S. Parameterization of disorder predictors for large-scale applications
requiring high specificity by using an extended benchmark
dataset. BMC Genomics. 2010;11(1):S15. https://doi.org/10.1186/
1471-2164-11-S1-S15.
10. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural
network for quantifying the function of DNA sequences. Nucleic Acids
Res. 2016;44(11):e107. http://nar.oxfordjournals.org/content/early/
2016/04/15/nar.gkw226.
11. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):
436-444. http://www.nature.com/nature/journal/v521/n7553/full/nature
14539.html.
12. Zeng H, Edwards MD, Liu G, Gifford DK. Convolutional neural network
architectures for predicting DNA–protein binding. Bioinformatics.
2016;32(12):i121-i127. http://bioinformatics.oxfordjournals.org/
content/32/12/i121.
13. Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein
disorder prediction: implications for structural proteomics. Structure
(London, England: 1993). 2003;11(11):1453-1459.
TABLE 8 Predictors recall by region length in CASP10
Method <10 AA 10-30 AA >30 AA
cnnAlpha 0.40 0.42 0.46
Espritz 0.43 0.39 0.33
Disopred3 0.26 0.32 0.47
Note: Bold value highlights the best performant run/method within the
column or row.
TABLE 9 Execution times on the CASP10 dataset by predictor
Method Total time Average time per protein
Disopred3 39 464 s 424 s
cnnAlpha 34 s 0.37 s
1480 OBERTI AND VAISMAN
14. Monastyrskyy B, Kryshtafovych A, Moult J, Tramontano A, Fidelis K.
Assessment of protein disorder region predictions in CASP10.
Proteins. 2014;82(02):127-137. http://www.ncbi.nlm.nih.gov/pmc/
articles/PMC4406047/.
15. Wang J, Wang W. A computational approach to simplifying the protein
folding alphabet. Nat Struct Mol Biol. 1999;6(11):1033-1038. http://
www.nature.com/nsmb/journal/v6/n11/full/nsb1199_1033.html.
16. Li T, Fan K, Wang J, Wang W. Reduction of protein sequence complexity
by residue grouping. Protein Eng. 2003;16(5):323-330. http://
peds.oxfordjournals.org/content/16/5/323.
17. Branden C. Introduction to protein structure. By C Branden and J Tooze.
pp 302. garland publishing, New York. 1991 ISBN 0–8513–0270–3
(pbk). Biochem Educ. 1992;20(2):121-122. http://onlinelibrary.wiley.
com/doi/10.1016/0307-4412(92)90129-A/abstract.
18. Mekler LB. Specific selective interaction between amino acid residues
of polypeptide chains; 1969. OCLC: 26411216.
19. Murphy LR, Wallqvist A, Levy RM. Simplified amino acid alphabets for
protein fold recognition and implications for folding. Protein Eng. 2000;
13(3):149-152. http://peds.oxfordjournals.org/content/13/3/149.
20. Bacardit J, Stout M, Hirst JD, Valencia A, Smith RE, Krasnogor N. Automated
alphabet reduction for protein datasets. BMC Bioinform. 2009;10
(1):6. http://www.biomedcentral.com/1471-2105/10/6/abstract.
21. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R.
Dropout: a simple way to prevent neural networks from overfitting.
J Mach Learn Res. 2014;15:1929-1958. http://jmlr.org/papers/v15/
srivastava14a.html.
22. Chollet F, Keras. San Francisco, CA: GitHub; 2015. Available from:
https://github.com/fchollet/keras.
23. Berman HM, Westbrook J, Feng Z, et al. The Protein Data Bank.
Nucleic Acids Res. 2000;28(1):235-242. http://nar.oxfordjournals.org/
content/28/1/235.
24. Haas J, Roth S, Arnold K, et al. The Protein Model Portal—a comprehensive
resource for protein structure and model information. Database
(Oxford). 2013;2013:bat031.
25. Wang G, Dunbrack RL. PISCES: a protein sequence culling server. Bioinformatics
(Oxford, England). 2003;19(12):1589-1591.
26. Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance
problem in convolutional neural networks. Neural Netw. 2018;106:
249-259. http://arxiv.org/abs/1710.05381,arXiv:1710.05381.
27. NCBI News: Spring 2004|BLASTLab. https://www.ncbi.nlm.nih.gov/
Web/Newsltr/Spring04/blastlab.html.
28. Monastyrskyy B, Fidelis K, Moult J, Tramontano A, Kryshtafovych A.
Evaluation of disorder predictions in CASP9. Proteins. 2011;79(S10):
107-118. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3212657/.
29. Huang T, He ZS, Cui WR, et al. A Sequence-based approach for predicting
protein disordered regions. Protein Pept Lett. 2013;20(3):243-248. http://
www.eurekaselect.com/openurl/content.php?genre=article&issn=0929-
8665&volume=20&issue=3&spage=243.
30. Wang S, Ma J, Xu J. AUCpreD: proteome-level protein disorder prediction
by AUC-maximized deep convolutional neural fields. Bioinformatics
(Oxford, England). 2016;32(17):i672-i679.
31. Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett.
2006;27(8):861-874. http://www.sciencedirect.com/science/article/
pii/S016786550500303X.
32. Walsh I, Martin AJM, Domenico TD, Tosatto SCE. ESpritz: accurate
and fast prediction of protein disorder. Bioinformatics. 2012;28(4):503-
509. http://bioinformatics.oxfordjournals.org/content/28/4/503.
33. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT. The DISOPRED
server for the prediction of protein disorder. Bioinformatics. 2004;20(13):
2138-2139. http://bioinformatics.oxfordjournals.org/content/20/13/2138.
34. Dosztányi Z, Csizmok V, Tompa P, Simon I. IUPred: web server for
the prediction of intrinsically unstructured regions of proteins based
on estimated energy content. Bioinformatics (Oxford, England). 2005;
21(16):3433-3434.
35. Oberti M, Vaisman II. Identification and prediction of intrinsically disordered
regions in proteins using N-grams. In: Proceedings of the 8th ACM
International Conference on Bioinformatics, Computational Biology,and
Health Informatics ACM-BCB '17, New York, NY, USA: ACM; 2017.
pp. 67–72. http://doi.acm.org/10.1145/3107411.3107480.
36. Uversky VN. Intrinsically disordered proteins from A to Z. Int J Biochem
Cell Biol. 2011;43(8):1090-1103. http://www.sciencedirect.
com/science/article/pii/S1357272511000914.
37. Prilusky J, Felder CE, Zeev-Ben-Mordehai T, et al. FoldIndexQc: a simple tool
to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics.
2005;21(16):3435-3438. http://bioinformatics.oxfordjournals.org/
content/21/16/3435.
38. Altschul SF, Madden TL, Schäffer AA, et al. Gapped BLAST and PSIBLAST:
a new generation of protein database search programs.
Nucleic Acids Res. 1997;25(17):3389-3402.
How to cite this article: Oberti M, Vaisman II. cnnAlpha:
Protein disordered regions prediction by reduced amino acid
alphabets and convolutional neural networks. Proteins. 2020;
88:1472–1481. https://doi.org/10.1002/prot.25966
OBERTI AND VAISMAN 1481