R E S E A R C H A R T I C L E A protein sequence fitness function for identifying natural and nonnatural proteins Rahul Kaushik | Kam Y. J. Zhang Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, Yokohama, Kanagawa, Japan Correspondence Kam Y. J. Zhang, Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, 1-7-22 Suehiro, Yokohama, Kanagawa 230-0045, Japan. Email: kamzhang@riken.jp Funding information Japan Society for the Promotion of Science, Grant/Award Number: 18H02395 Abstract The infinitesimally small sequence space naturally scouted in the millions of years of evolution suggests that the natural proteins are constrained by some functional prerequisites and should differ from randomly generated sequences. We have developed a protein sequence fitness scoring function that implements sequence and corresponding secondary structural information at tripeptide levels to differentiate natural and nonnatural proteins. The proposed fitness function is extensively validated on a dataset of about 210 000 natural and nonnatural protein sequences and benchmarked with existing methods for differentiating natural and nonnatural proteins. The high sensitivity, specificity, and percentage accuracy (0.81%, 0.95%, and 91% respectively) of the fitness function demonstrates its potential application for sampling the protein sequences with higher probability of mimicking natural proteins. Moreover, the four major classes of proteins (α proteins, β proteins, α/β proteins, and α + β proteins) are separately analyzed and β proteins are found to score slightly lower as compared to other classes. Further, an analysis of about 250 designed proteins (adopted from previously reported cases) helped to define the boundaries for sampling the ideal protein sequences. The protein sequence characterization aided by the proposed fitness function could facilitate the exploration of new perspectives in the design of novel functional proteins. K E Y W O R D S amino acid propensity, protein foldability, computational protein design, natural proteins, protein sequence space, scoring of protein designs 1 | INTRODUCTION Delineating the connection between protein sequence and its structure is one of the most persuasive, debatable, and unresolved affairs in the field of computational structure biology.1-3 In the last two decades, the concepts of delicate contribution of natural selection and the modest alteration by evolution in random copolymers in emergence of known proteins is extensively discussed and argued for its significance in origin of Life.4-8 The specificity of existent natural proteins to encrypt unique protein structures is restricted within a limited number of folds (1457 protein folds) which poses another scientific challenge of quantifying the protein designability of sampled sequences into the existing folds. Creating a protein to perform a predefined or novel biological function(s) is often considered as protein design. In general, protein design is formulated as composing an amino acid sequence which should ideally fold into a stable structure, meant to perform some biological function(s).9-13 The goals of protein design include either desired optimization of certain characteristics such as stability, solubility, and binding affinity or designing entirely novel sequences resulting into novel structures or attaining remedial or industrial utilities.14-17 A primary and a very crucial step of novel protein design involves the computational identification or generation of potential protein sequences having a considerably high probability of mimicking the naturally occurring proteins and eventually folding Received: 15 January 2020 Revised: 26 March 2020 Accepted: 7 May 2020 DOI: 10.1002/prot.25900 Proteins. 2020;88:1271–1284. wileyonlinelibrary.com/journal/prot © 2020 Wiley Periodicals, Inc. 1271 into a compact structure.18,19 Some of the earlier studies measured the degree of randomness in the known protein sequences to explore the logical explanations with conflicting inferences for constrained available sequence space.5,7,8,19-22 Some of the previous computational studies of random sequence proteins argued over the extent of variability among natural protein sequences and random protein sequences.23-25 In protein design regimes, folding into a distinct conformation is foremost requirement. It is believed that the randomly generated and de-novo evolved protein sequences tend to form a molten globule state with marginal secondary structural elements.18,26,27 However, most of the recent approaches fortified with deep learning and artificial intelligence have contributed significantly in classifying the dataset6,7,19,28,29 but without exploring the underlying science. For the 100 residue long protein sequences, the theoretical sequences space of astronomically staggering 10130 proteins (20100 combinations) in contrast to the infinitesimal fraction of naturally existing proteins. For instance, when a nonredundant dataset of all available protein sequences in the UniProt database (22 million sequences, excluding predicted, and uncertain proteins) is analyzed, only 109 unique stretches of 100 residues could be extracted. This huge decline in the number of available compact structures and unique 100 residues polypeptides substantiate the possibility of some underlying protein signatures at sequence level that lend the protein with potential of imitating the natural proteins and fold into stable structure. Some of the previous studies have utilized the concept of neighboring effect of amino acid residues in dictating protein secondary and tertiary structures.30-33 Here, we describe a sequence and secondary structure-based fitness scoring function to identify potentially foldable/designable protein sequences by differentiating them from nonnatural protein sequences. The presented fitness function implements the competency scores derived from sequences and corresponding secondary structures of well-characterized known protein domains (natural proteins, NP) and computationally generated nonnatural protein sequences with natural amino acid compositions (NNP-NC) and with uniform amino acid compositions (NNP-UC). The scoring function classifies a query protein sequence into foldable (natural protein) or non-foldable (nonnatural and/or random protein) depending on its competency scores compared with natural and nonnatural protein sequences. 2 | MATERIALS AND METHODS For the development of the scoring function, the datasets of natural protein (NP) sequences (adopted from known protein domains) and computationally generated nonnatural protein (NNP-NC and NNPUC) sequences are compiled. 2.1 | Dataset compilation The protein sequences and corresponding secondary structures of all known protein domains in the latest stable release of the SCOPe database,34 SCOPe 2.07 are extracted which comprises 274 230 protein domains. These domains are subjected to clustering at 100% sequence identity level using CD-HIT35 to filter out the redundant proteins in the dataset. Post-clustering, resulting 77 280 domains are further screened for the presence of non-standard amino acid residues, missing residues (except for N and C terminals), domains having less than 50 residues, domains having more than 700 residues, or membrane protein domains. These filters resulted in a dataset of 58 758 globular protein domains as depicted in the Figure S1. The 100% sequence identity level filter is used to ensure the inclusion of a maximum number of possible triplets of amino acid residues and corresponding secondary structural elements. However, sequence identity filters at 80%, 60%, and 40% sequence identity levels are also used to explore the possibilities. A significant decline in the available number of combinations of triplets of amino acid residues and corresponding secondary structure is observed. The statistics related to availability of combinations of triplets is shown in Figure S2 and Supplementary Note I. The dataset corresponding to these protein sequences is referred as natural proteins (NP) dataset hereafter, as it is derived from naturally existing known proteins. Similarly, a dataset of 65 000 proteins having sequence length varying from 50 to 700 residues is generated computationally restraining the amino acid compositions adopted from UniProtKB.36 Since the dataset of computationally generated protein sequences is constrained to the same amino acid composition as naturally existing proteins, it is referred to as nonnatural protein dataset with natural distribution of amino acid compositions (NNP-NC). The “makeprotseq” module of EMBOSS37 is used for computationally generating these protein sequences. As a cautionary measure, the NNP-NC dataset is also subjected to clustering at 100% sequence identity level to avoid the sequence redundancy. However, it is observed that these computationally generated sequences did not have any redundancy at 100% sequence identity. 2.2 | Extraction of secondary structural information For the selected natural protein (NP) dataset, the secondary structural information at an individual residue level for each protein is extracted using the standalone version of STRIDE secondary structure assignment program.38 The 8-class secondary structure assignment of STRIDE is converted into 3-class secondary structure assignment for further processing. In this conversion, the 310 helices (G), π-helices (I), and 4-turn helices (H) are grouped as helices, the extended strands in β-sheet conformations (E) and isolated β-bridges (B) are pooled together as strands (E), and the hydrogen bonded turns (T), coils (C) and bends (S) are bundled as loops (C). For secondary structural information corresponding to nonnatural proteins with natural AA compositions (NNP-NC) dataset, secondary structure prediction using standalone version PSIPRED (PSIPRED 4.02) is performed.39 Considering the current state of the art for protein secondary structure prediction, PSIPRED is reported to deliver a reasonably high accuracy 1272 KAUSHIK AND ZHANG and thus used in present study. It is worth noting that the PSIPRED failed to predict any secondary structure for 3862 proteins. These proteins are discarded from any further processing. A sub-dataset of 58 758 proteins is selected from the nonnatural proteins (NNP-NC) dataset (out of 61 138 proteins with predicted secondary structure). This led into a total of 117 516 proteins sequences, comprising 58 758 proteins each in natural proteins (NP) and nonnatural proteins with natural AA compositions (NNP-NC) datasets. To examine the differences in amino acid neighbor preferences in different secondary structures for computationally generated sequences NNP-NC, we derived the conditional probabilities of triplets using natural protein and nonnatural protein (NNP-NC) sequences and corresponding secondary structures. 2.3 | Classifying into reference and test datasets The natural proteins (NP) and nonnatural proteins (NNP-NC) datasets are randomly separated into two parts each as reference dataset of 41 132 proteins and test dataset of 17 626 proteins (reference = 70% and test = 30% of 58 758 proteins). This resulted into four subdatasets, viz. natural proteins reference dataset (comprising 41 132 proteins), natural proteins test dataset (comprising 17 626 proteins), nonnatural proteins (NNP-NC) reference dataset (comprising 41 132 proteins) and nonnatural proteins (NNP-NC) test dataset (comprising 17 626 proteins). The reference datasets are used for deriving a conditional probability-based statistical model, leading to competency scores of tripeptides and the test datasets are used for testing the efficiency of competency scores in distinguishing the natural protein (NP) and nonnatural protein (NNP-NC) sequences. 2.4 | Compiling sequence-based scoring libraries For all the protein sequences in the natural proteins reference dataset, tripeptides frequencies are calculated for all possible 8000 combinations. Also, individual amino acid residues occurrence frequencies are calculated from natural proteins reference dataset. It may be noted that the natural protein reference dataset represents all the possible combinations at tripeptides level sufficiently, encompassing more than 8 million tripeptides (depicted in Figure S3). The residue occurrence frequencies and tripeptide frequencies are further used for calculating tripeptide conditional probabilities using Equation (1). Notably, the conditional probability calculated in Equation (1) considers forward (C-terminal) and backward (N-terminal residue) neighborhoods of the central residue. Also, this consideration takes care of directionality in the tripeptides as P(YM|XNZC) is not same as P(YM|ZNXC). So, it may be considered that the conditional probabilities of tripeptides calculated in Equation (1) is inclusive of their residue-based adjacency and directionality statistics. P YMjXNZCð Þ = P XYZð Þ P Yð Þ , ð1Þ where X, Y, and Z belong to any of the standard amino acid residues; P(YM|XNZC) is the conditional probability of residue “Y”, given a residue “X” on its N-terminal and a residue “Z” on its C-terminal; P(XYZ) is the probability of tripeptide “XYZ”; and P(Y) is the probability of residue “Y”. The conditional probabilities of all tripeptides as calculated using Equation (1) are further used to compute a percentage sequence competency score (CS-Score) at individual residue level by normalizing the conditional probabilities with the maximum conditional probability in all combinations of tripeptides. The CS-Score is calculated for the middle residue in a tripeptide considering one adjacent residue on its either side (one toward N-terminal and one toward C-terminal) using Equation (2). CS−score XNYMZCð Þ = 100 P YMjXNZCð Þ Pmax AAMjAANAACð Þ   , ð2Þ where CS-score (XNYM ZC) is the competency score of middle residue “Y” given residues X and Z at its N-terminal and C-terminal, respectively; P(YM|XNZC) is the conditional probability of residue “Y”, given a residue “X” on its N-terminal and a residue “Z” on its C-terminal (as computed in Equation (1)); and Pmax (AAM|AANAAC) is the maximum conditional probability in all 8000 tripeptides. The CS-Scores derived from Equation (2) resulted in 400 values for an individual residue, accounting for the occurrence of any of the 20 amino acid residues on either side. The overall flow of computation of CS-Scores is depicted in Figure 1A,B, with an example tripeptide, Lys-Ala-Met. These scores are used to evaluate the overall competence of protein sequences as discussed in results section. 2.5 | Compiling sequence and secondary structure based scoring libraries As mentioned above, the secondary structural information at 3-class levels is compiled from natural proteins dataset. The tripeptide frequencies along with corresponding secondary structure assignments (Helix (H), Strand (E), and Coils (C)) are derived from the natural proteins reference datasets for all possible combinations, that is, 216 000 combinations (203 × 33 ). It may be noted that all the possible combinations could not be observed in the natural protein reference dataset as seven out of 27 secondary structure combinations are practically not possible, viz. HEH, HEC, EHE, EHC, CHC, CHC, CEH. The secondary structure directed tripeptide frequencies are used to derive the probability of each available combination. Also, for all the individual residues with their secondary structure assignment (20 × 3 combinations), probabilities are calculated. The tripeptide and individual residue probabilities are further used for calculating secondary structure directed tripeptides conditional probabilities using Equation (3). P Y Sy M jYSx N YSz C   = P XSx YSy ZSz   P YSy   , ð3Þ KAUSHIK AND ZHANG 1273 where X, Y, and Z belong to any of the standard amino acid residues; Sx, Sy, and Sz belong to any of the three secondary structure assignments (H or E or C); P Y Sy M jYSx N YSz C   is the conditional probability of the middle residue “Y” having secondary structure “Sy”, given a residue “X” having secondary structure “Sx” toward N-terminal and a residue “Z” having secondary structure “Sz” toward C-terminal; P XSx YSy ZSz   is the probability of a tripeptide “XYZ” having secondary structure “SxSySz”; and P YSy   is the probability of middle residue “Y” having secondary structure “Sy”. It may be noted that “S” can assume any of the three secondary structure assignments (H, E, and C) but should be exactly the same for corresponding middle, N-terminal, and C-terminal residues to maintain the forward and backward neighborhood, and directionality of secondary structural triplets. The conditional probabilities calculated in Equation (3) are further used for calculating sequence and secondary structure-based percentage competency score (CSS-Score) at an individual residue level by normalizing the conditional probabilities with the maximum conditional probability in all available combinations of the tripeptides having exactly same secondary structure assignment. The CSS-Score is calculated for the middle residue in a tripeptide with its secondary structure considering one adjacent residue of either side having specific secondary structure assignments as shown in Equation (4). CSS−score XSx N Y Sy M Z Sz C   = 100 P Y Sy M jXSx N ZSz C   Pmax AA Sy M jAASx N AASz C   0 @ 1 A, ð4Þ where CSS−score XSx N Y Sy M Z Sz C   is the sequence and secondary structure-based competency score of the middle residue “Y” having a secondary structure “Sy”, given a residue “X” having a secondary structure “Sx” toward N-terminal and a residue “Z” having a secondary structure “Sz” toward C-terminal; P Y Sy M jXSx N ZSz C   is the conditional probability of the middle residue “Y” having a secondary structure “Sy”, given a residue “X” having a secondary structure “Sx” toward N-terminal and a residue “Z” having a secondary structure “Sz” toward C-terminal; and Pmax AA Sy M jAASx N AASz C   is the maximum conditional probability observed for any of the tripeptides with exactly the same secondary structure for middle, N-terminal, and C-terminal residues. The overall flow of computation of CSS-Scores is depicted in Figure 1C, D with an example tripeptide, Lys-Ala-Met with helices as secondary structure assignment for all the three residues. These scores are used to evaluate the overall competence of natural and nonnatural protein sequences. It is worth mentioning that the CS-Scores and CSS-Scores are nonzero and positive values which may be maximum up to 100. The presently used 100% sequence identity filter ensured inclusion of the maximum number of possible triplets of amino acid residues and corresponding secondary structural elements. The normalization used in the Equations (2) and (4) calculates the score as a ratio of probabilities and removes the statistical bias due to closely related sequences. To investigate it further, all the natural protein sequences are clustered at lower sequence identity filters viz. 80%, 60%, and 40% and CSS-Scores libraries are compiled using Equations (3) and (4). A very high correlation is observed among the CSS-Scores of the triplets derived from the FIGURE 1 The overall flow of compiling scoring libraries. (A) A stepwise outline for calculating sequence-based competency score (CS-Score) of a residue by considering its adjacent residues toward N- and C-terminals. (B) A stepwise depiction for calculation of sequence-based competency score of an example tripeptide (Lys-Ala-Met is considered here). (C) A stepwise outline for calculating sequence and secondary structure-based competency score (CSS-Score) of a residue with a specific secondary structure by considering its adjacent residues toward N- and C-terminals with specific secondary structure assignment. (D) A stepwise depiction for calculation of sequence and secondary structurebased competency score of an example tripeptide (Lys(H)-Ala(H)Met(H) is considered here) 1274 KAUSHIK AND ZHANG natural protein sequences at different sequence identity filters (40% and 100% = 0.95, 60% and 100% = 0.96%, 80% and 100% = 0.96) as shown in Figure S2. Considering the high similarity in CSS-Score libraries and the decline in triplet combinations at different sequence identity filters, it may be posited that the filtering at 100% sequence identity should not impart any bias to the statistics while ensuring the maximum utilization of available information at triplet level. 2.6 | Calculation of competency score for a protein sequence For calculation of overall competency scores of a given protein sequence, the CS- and CSS-Scores of individual residues are used. It may be noted that the first residue (N-terminal residue) and the last residue (C-terminal residue) do not have their individual CS- and CSSScores. The overall CS- and CSS-Scores for a given protein may be calculated as shown in Equations (5) and (6). OverallCS−ScoreProtein = Pi = N−1ð Þ i = 2 CS−Score ið Þ N−2ð Þ , ð5Þ where N is sequence length of the protein for which the overall CSScore is to be calculated, CS-Score(i) is CS-Score of individual residues as calculated in Equation (2). OverallCSS−ScoreProtein = Pi = N−1ð Þ i = 2 CSS−Score ið Þ N−2ð Þ , ð6Þ where N is sequence length of the protein for which the overall CSSScore is to be calculated, CSS-Score(i) is CSS-Score of individual residues as calculated in Equation (4). 2.7 | Competency scores for natural proteins The sequence and sequence and secondary structure scoring libraries (CS-scores and CSS-Scores) are used to calculate overall competency scores for individual sequences in natural proteins reference dataset of 41 132 proteins. The distribution curves for average competency scores in terms of CS-Scores and CSS-Scores are shown in Figure 2 (colored in green). Additionally, the scatter plots of average competency scores are shown in the Figure S4 for better insight. It is observed that the sequence-based competency scores (CS-Scores) averaged at 33.2 ± 3.14 and the sequence and secondary structurebased competency scores (CSS-Scores) averaged at 18.0 ± 3.61 for the reference dataset of natural protein sequences. 2.8 | Competency scores for nonnatural proteins (NNP-NC) For all the computationally generated protein sequences in nonnatural protein (NNP-NC) reference dataset, the overall competency scores for individual sequences are calculated by using tripeptide-based CSScores and CSS-Scores. The distribution curves of CS-Scores and CSS-Scores for nonnatural protein (NNP-NC) sequences are shown in FIGURE 2 The distribution curves of competency scores. (A) Distribution curves for sequence-based competency scores of natural (in green color) and nonnatural (in red color) (CS-Scores). (B) Distribution curve for sequence and secondary structure-based competency scores (CSS-Scores) for natural (in green color) and nonnatural (in red color) proteins in corresponding reference datasets. The CSS-Scores are reflecting a better differentiation of natural and nonnatural proteins as compared to CS- Scores KAUSHIK AND ZHANG 1275 Figure 2 (colored in red). Also, the scatter plots of competency scores of individual proteins in nonnatural proteins (NNP-NC) reference dataset are depicted in the Figure S5. In case of reference dataset of nonnatural protein (NNP-NC) sequences, the observed sequencebased competency scores averaged at 31.7 ± 1.78 and the sequence and secondary structure-based competency scores averaged at 14.6 ± 1.45. In the present study, the secondary structure prediction is used to estimate the likelihood of secondary structure for the nonnatural proteins which is further utilized in performing overall scoring of nonnatural proteins. Further, the method used here for the secondary structure prediction is not exclusively dependent on amino acid substitution matrix (viz. BLOSUM62), it also implements three different neural network weights which are expected to improve the prediction accuracy. 2.9 | Differences in competency scores of natural proteins (NP) and nonnatural proteins (NNP-NC) It is very difficult to conclude directly from the average competency scores for natural proteins (NP) and nonnatural protein sequences (NNP-NC) if these deviates meaningfully. For testing the significance of differences in the average competency scores for natural (NP) and nonnatural protein (NNP-NC) sequences in the reference datasets, z-test of two samples for means is conducted on competency scores of 41 132 natural protein sequences and 41 132 nonnatural protein sequences (41 132 observations each). Based on the outcome of z-test, in case of sequence-based competency scores (CS-Scores), the natural protein sequences (μ = 33.2, σ = 3.14, n = 41 132) and nonnatural protein (NNP-NC) sequences (μ = 31.7, σ = 1.78, n = 41 132) are hypothesized to be different. The difference is very significant, z = 96.73, P = .00 (two-tail). Also, in case of sequence and secondary structure-based competency scores (CSS-Scores), the natural protein sequences (μ = 18.0, σ = 3.61, n = 41 132) and nonnatural protein (NNP-NC) sequences (μ = 14.6, σ = 1.45, n = 41 132) are hypothesized to be different. The difference is very significant, z = 210.75, P = .00 (two-tail). Further details of z-test statistics are provided in the Table S1. To investigate the differences in amino acid neighbor preferences in different secondary structures for computationally generated nonnatural protein sequences (NNP-NC), we derived the conditional probability of triplets using these sequences and corresponding predicted secondary structures. The conditional probabilities of triplets in natural proteins and computationally generated nonnatural proteins showed a correlation of 0.73, which supports the assumption that the computationally generated protein sequences have differences in amino acid neighbor preferences in different secondary structures. These differences in the neighboring preferences may help in computational sampling of protein sequences with higher potential of mimicking the natural proteins. Further, to investigate the possibility of computational bias, instead of their original secondary structures, the predicted secondary structures for the natural proteins are used to recalculate the CSS-Score libraries. It is observed that the CSS-Score libraries computed using predicted secondary structures showed a significant similarity (r = 0.94) with the CSS-Score libraries computed using the experimental secondary structures. Additionally, a z-test is performed to further analyze the differences in the CSS-Score libraries as reported in the supplementary materials (Table S2). The CSS-Score library derived from experimental secondary structures of natural protein (μ = 5.68, σ = 8.00, n = 91 222) and the CSS-Score library derived from predicted secondary structure of natural proteins (μ = 5.69, σ = 8.29, n = 91 222) are hypothesized to be significantly similar (z = −0.39, P = .69 (two-tail)). As the difference is not significant, it may be assumed that in case of sampling and scoring novel proteins, the performance of the proposed scoring function may not change significantly upon using the predicted secondary structures. 2.10 | Efficacy of competency scores The receiver operating characteristic curve (ROC Curve) is one of the most prevalent and extensively instigated statistical tools for assessing the discriminatory efficacy of a given classifier. Here, the average competency scores of the individual proteins at sequence (CS-Score) and sequence and secondary structural (CSS-Score) levels are assessed for their potential to differentiate natural (NP) and nonnatural protein (NNP-NC) sequences. Under the assumption that the higher CS-Score and CSS-Score for a protein are indicative of its imitating behavior as natural proteins and lower CS-Score and CSS-Score for a protein are suggestive of imitating behavior as nonnatural proteins (NNP-NC). At different threshold values of competency scores, different pairs of sensitivity and specificity are derived from reference dataset of natural and nonnatural proteins using Equation (7) as follows. ROC tð Þ = FPR tð Þ,TPR tð Þð Þ,t Range of Competency Scoreð Þf g, ð7Þ where FPR(t) is the false positive rate at a threshold value “t”; TPR(t) is the true positive rate at a threshold value “t”. The ROC curves are plotted with two underlying assumption, (a) the potential of competency scores to identify the natural proteins (NP) and (b) potential of competency scores to identify the nonnatural proteins (NNP-NC). At different thresholds of CS-Scores, ROC(t)Natural, and ROC(t)Nonnatural are calculated and plotted in Figure 3A. Similarly, at different thresholds of CSS-Scores, ROC(t)Natural, and ROC(t)Nonnatural are calculated and plotted in Figure 3B. The threshold values at point of intersection ROC curves of natural and nonnatural proteins are observed to be the optimum cutoff for differentiating natural and nonnatural proteins. The ROC curves of CS-Score for natural proteins (NP) and nonnatural protein (NNP-NC) sequences are intersecting at a threshold value of 32.15. At intersection point, the sensitivity and specificity in identifying natural (NP) and nonnatural proteins (NNP-NC) is 0.62 and 0.63, respectively. However, the Mathews Correlation Coefficient (MCC) at CS-Score cutoff value of 32.15 is 0.26 which indicates weak prediction model for binary classification. The low value of MCC is 1276 KAUSHIK AND ZHANG suggestive of inability of CS-Score in discriminating natural and nonnatural proteins. The ROC curves of CSS-Score for natural (NP) and nonnatural proteins (NNP-NC) showed a considerably improved sensitivity and specificity at their intersection threshold value. The CSSScore based ROC curves intersected at 15.5 where the sensitivity is 0.76 and specificity is 0.77. Notably, the Mathews Correlation Coefficient (MCC) at CSS-Score cutoff value of 15.5 is 0.54 which is suggestive of strong prediction model for binary classification. The calculation of sensitivity, specificity, and Mathews Correlation Coefficient is explained in supplementary information (Supplementary Note II). From ROC curves, sensitivity, specificity, and MCC values, it may be interpreted that the only sequence-based competence score (CSScore) is not very efficient in discriminating natural (NP) and nonnatural proteins (NNP-NC). However, the sequence and secondary structure-based competency score (CSS-Score) reflects a promising potential of discriminating natural (NP) and nonnatural proteins (NNPNC). The performances of CS- and CSS-Score are further evaluated on different datasets and discussed in results section. 2.11 | Competency score analysis at residue level in individual sequences The efficacy of competency scores in discriminating natural proteins (NP) and nonnatural protein (NNP-NC) sequences does not elucidate the extent of its prediction reliability. To explore this further, a residue level analysis of competency scores of individual proteins of natural and nonnatural reference datasets is performed. In case of CS-Scores, if a protein sequence is classified as natural protein (NP) on the basis of its overall competency score (CS-Score ≥ 32.15) and more than 59% of its residues are scoring above the threshold, then it is scoring better than 80% of the natural proteins in reference dataset and may be considered as natural protein with 80% confidence value. The required number of percentage residues scoring above the threshold in a protein increases to 69% for it to score better than 95% of the natural proteins in reference dataset. Likewise, if a protein is classified as nonnatural protein on the basis of competency score (CS-Score < 32.15), and more than 61% of its residues are scoring below the threshold, then it is scoring better than 80% of the nonnatural proteins and qualifies as nonnatural protein with 80% confidence value. The required number of percentage residues scoring below the threshold in a protein increases to 67% for it to be classified as nonnatural protein with 95% possibility. In case of CSS-Scores, for classifying a protein as natural protein, having scored better than 80% of natural proteins in reference dataset, it needs to have more than 62% of its residues scoring above the threshold (CSS-Score ≥ 15.50). The required percentage number of residues scoring above the threshold increases to 72% for classifying a protein as natural with score better than 95% natural proteins. For identifying a protein as nonnatural protein having outscored 80% of nonnatural proteins, 64% of its residues must be scoring below the threshold (CSS-Score < 15.50). The percentage number of residues scoring below the threshold increases to 70% for identifying a protein as nonnatural with outscoring 95% of nonnatural proteins. Figure 4A shows the distribution of percentage number of proteins in natural and nonnatural proteins reference datasets (on y-axis) against different cutoffs of percentage residues scoring below the threshold values of CSScores. Similarly, Figure 4B shows the distribution of percentage number of proteins in natural and nonnatural proteins reference datasets against different cutoffs of percentage residues scoring below the threshold values of CSS-Scores. It is worth mentioning that in case of natural proteins, the percentage of residues above threshold is considered while in case of nonnatural proteins, the percentage of residues scoring below the threshold is accounted. Thus, in case of natural protein while referring to Figure 4, the percentage number of proteins at different values of percentage number of residues scoring above the threshold can be calculated by subtracting the corresponding value from 100. Additionally, the different values of percentage residues scoring above and below the threshold for natural and nonnatural proteins in reference datasets are furnished in supplementary Table S3. It may be noted that the extent of overlap in percentage number of residues cutoffs is relatively less in case of CSS-Scores which is indicative of its better discriminating potential of natural and nonnatural proteins. The same is demonstrated on the different datasets and discussed in results section. FIGURE 3 The ROC curves for identifying natural and nonnatural proteins. (A) CS-Scores thresholdbased ROC curves for natural and nonnatural proteins. (B) CSS-Scores threshold-based ROC curves for natural and nonnatural proteins [Color figure can be viewed at wileyonlinelibrary.com] KAUSHIK AND ZHANG 1277 2.12 | Competency score based prediction of example protein For a given target protein, the sequence-based competency scores (CS-Score) for each residue (except for first and last residues) are calculated using precompiled CS-Scores libraries (explained in Section 2.4). The overall competency score is calculated from the scores of individual residues as shown in Equation (5). Based on the average CS-Score, the protein is predicted as natural (CS-Score ≥ 32.15) or nonnatural protein (CS-Score < 32.15). Also, the percentage of residues scoring below or above the threshold are calculated and employed for deriving the possibility of the predictions accuracy by comparing it with the values to the distribution of natural and nonnatural proteins. The overall flow of carrying out CS-Score based prediction of a target protein is demonstrated in Figure 5A. Further, the secondary structure prediction of a target protein (if not known) is performed using the standalone version of PSIPRED. The sequence and secondary structure information is applied for calculating sequence and secondary structure-based competency scores (CSS-Scores) for each residue (except for first and last residues) by utilizing the precompiled CSS-Scores libraries. The overall CSS-Score for the target protein is calculated and used for classifying it as natural (CSS-Score ≥ 15.50) or nonnatural (CSS-Score < 15.50). The percentage of residues scoring above or below the threshold is used for deriving the possibility of prediction accuracy via a comparison to the distribution of natural and nonnatural proteins in reference datasets. The overall flow of performing CSS-Score based prediction of a target protein is demonstrated in Figure 5B. In case of CS-Score based prediction (Figure 5A), the example target protein is identified as natural protein (CS-Score ≥ 32.15), having about 39% residues scoring below threshold (61.2% residues scoring above threshold). Referring to Table S3 (column 1 and 2, row 39), the example target protein is scoring better than 85% natural proteins. Likewise, in case of CSS-Score based prediction (Figure 5B), the example target protein is identified as natural protein (CSS-Score ≥ 15.50), having about 30% residues scoring below threshold (69.9% residues scoring above threshold). Referring to Table S3 (column 1 and 4, row 30), the example target protein is scoring better than 93% natural proteins. Since the competency score libraries and threshold values are precomputed from reference datasets of natural and nonnatural proteins, the batch calculation of CS-Score and CSS-Score is very time and computationally efficient. 3 | RESULTS AND DISCUSSION The performance of CS- and CSS-Scores is evaluated on a test dataset of natural proteins (NP) and nonnatural proteins (NNP-NC) (17 626 proteins each, as mentioned in Section 2.3). Additionally, a dataset of FIGURE 4 Distribution of natural (in green) and nonnatural (in red) proteins at different values of their percentage residues scoring below the derived cutoffs from ROC curves. (A) CSScore based distribution, highlighting percentage residues scoring below threshold for 80% of natural and nonnatural proteins. (B) CSS-Score based distribution, highlighting percentage residues scoring below threshold for 80% of natural and nonnatural proteins [Color figure can be viewed at wileyonlinelibrary.com] 1278 KAUSHIK AND ZHANG 57 000 unique natural proteins (clustered at 40% sequence identity) of sequence length varying from 50 to 700 residues from UniProtKB is selected after excluding all the natural proteins of SCOPe database (58 758 proteins). Further, two more datasets of 57 000 computationally generated proteins, one with natural AA compositions and another with uniform amino acid compositions constraint (NNP-NC and NNP-UC) are considered for quantifying the ability of competency scores in differentiating natural proteins from nonnatural proteins. 3.1 | Evaluation on natural and nonnatural proteins in test datasets The test datasets of natural proteins (NP) and nonnatural proteins (NNP-NC) are subjected to calculation of competency scores. For CSS-Score calculation, the secondary structure of individual natural protein is extracted from the corresponding structure, while the secondary structure of individual nonnatural protein (NNP-NC) is predicted using PSIPRED. The overall CS- and CSS-Scores of proteins in test datasets are calculated and further used for categorizing them into natural and nonnatural based on the threshold values (Natural Proteins ≥ CS-Score 32.15 > Nonnatural Proteins; (Natural Proteins ≥ CSS-Score 15.50 > Nonnatural Proteins). The evaluation statistics of CS- and CS-Scores are reported in Table 1. The distribution of CSand CSS-Scores for all the proteins in the Test Dataset is shown in Figure S6. Here, it may be noted that the CSS-Score based categorization of natural and nonnatural proteins outperformed CS-Score based categorization. It reflects the gain in prediction accuracy with the addition of secondary structure information. 3.2 | Evaluation on external dataset of natural and nonnatural sequences For assessing the performance of the proposed competency scores, an independent dataset of reviewed proteins from UniProtKB is extracted by filtering out all the natural proteins considered in reference and test datasets of natural proteins. The filtered reviewed proteins are further clustered to 40% sequence identity to eliminate the closely related proteins which resulted into 56 637 proteins. This dataset of unique reviewed proteins from UniProtKB is referred to as external dataset of natural proteins. For all the proteins in external dataset of natural proteins, the CS- and CSS-Scores are calculated and compared to threshold values identified in methods section, that is, Natural Proteins (CS-Score ≥ 32.15; CSS-Score ≥ 15.50). Based on CS-Score threshold, it is observed that only 33 729 (59%) proteins could be identified as natural proteins. However, CSS-Score based evaluation performed much better by identifying 45 876 (81%) proteins as natural proteins. In the nonnatural proteins dataset of 58 758 proteins, the amino acid compositions were constrained to corresponding amino acid compositions of natural proteins. Similarly, another dataset of nonnatural proteins comprising 60 000 proteins is computationally generated and clustered at 40% sequence identity to ensure the absence of similar proteins. Further, the clustered proteins are screened against the previously considered nonnatural proteins dataset to filter out the similar proteins. The clustering and filtering resulted in a new dataset of nonnatural proteins comprising 56 873 unique proteins, entirely independent of the nonnatural proteins used in deriving thresholds for CS- and CSS-Scores. This new dataset of 56 836 nonnatural proteins is referred to as external dataset of nonnatural proteins (NNPNC). The CS- and CSS-Scores for the external dataset of nonnatural proteins are calculated and classified using the previously derived thresholds for nonnatural proteins (CS-Score < 32.15; CSS-Score < 15.50). The CS-Score based identification of nonnatural proteins categorized 48 324 (85%) proteins as nonnatural while CSS-Score could identify 51 153 (90%) proteins as nonnatural. So far in this study, we have used natural proteins, adopted from SCOPe and UniProtKB databases, and nonnatural proteins, computationally generated with the same amino acid composition as the FIGURE 5 Demonstration for prediction of a target protein as natural or nonnatural protein. (A) Prediction for target protein derived from average CS-Score and the percentage of residues scoring above the threshold. (B) Prediction for a target protein derived from average CSS-Score and the percentage of residues scoring above the threshold [Color figure can be viewed at wileyonlinelibrary.com] TABLE 1 Assessment of CS- and CSS-scores on test datasets of 35 252 proteins for identifying natural and nonnatural proteins Statistics CS-score CSS-scores Sensitivity 0.62 0.76 Specificity 0.62 0.75 Accuracy 0.62 0.74 Mathews correlation coefficient 0.25 0.53 KAUSHIK AND ZHANG 1279 natural proteins in UniProt database. Further, in this study, a dataset of computationally generated 60 000 random proteins, with all amino acid residues having equal probability of occurrence, is used for evaluating the potential of the competency scores in discriminating natural proteins from randomly generated proteins. This dataset of nonnatural proteins with uniform composition of amino acid residues (NNP-UC) is clustered at 40% sequence identity which resulted in 57 374 unique nonnatural proteins. All these proteins scored within the threshold derived for nonnatural proteins (CS-Score < 32.15; CSSScore < 15.50). A distribution of CS- and CSS-Scores of the proteins in the external dataset of nonnatural proteins is shown in Figure 6. It is worth mentioning that the CSS-Score based categorization of the natural, nonnatural, and the random proteins performed consistently on a considerably higher side. Clearly, the CSS-Score emerges as a far better measure than CS-Score for identifying natural proteins from nonnatural and random proteins. A summary of performance of CS- and CSS-Score in identifying natural and nonnatural protein sequences in external dataset of proteins is provided in Table 2. For further statistical evaluation, the external datasets are combined where the natural proteins are tagged as positives, and nonnatural and random proteins are tagged as negatives. For CS-Score identification, the sensitivity, specificity, and Mathews correlation coefficient are observed to be 0.60, 0.92, and 0.57, respectively. Likewise, for CSS-Score based identification, the sensitivity, specificity, and Mathews correlation coefficient are 0.81, 0.95, and 0.79, respectively. 3.3 | Benchmarking with existing methods The CS- and CSS-Score based identification of natural and nonnatural proteins is further benchmarked with existing methods. It is worth noting that there are not many methods available for directly scoring the protein sequences to classify them as natural and nonnatural proteins. Here, we benchmarked the present scoring method with FoldIndex40 and FOLD.20,41 The FoldIndex method implements average residue hydrophobicity and net charge to derive the foldability or unfoldability of a given protein sequence where the positive score represents foldable and the negative score represents unfoldable. Some of the proteins scored very close to zero (−0.005 ≤ SCORE ≤ 0.005) which are accounted as unreliable prediction. A very recently proposed method, named FOLD, utilizes the precomputed triplet (FOLD3) and quadruplet (FOLD4) frequencies in natural and random protein sequences to classify a given protein sequence into any of the four classes, viz. sure FIGURE 6 A comparison of CS- and CSS-Scores across external datasets of natural, computationally generated nonnatural (NNP-NC and NNP-UC). A downward trend is observed from natural proteins to nonnatural proteins [Color figure can be viewed at wileyonlinelibrary.com] TABLE 2 A summary of CS- and CSS-score identification on different external datasets External dataset Total number of proteins CS-score natural CS-score nonnatural CSS-score natural CSS-score nonnatural Natural 56 637 33 729 22 908 45 876 10 761 Nonnatural (NNP-NC) 56 873 8549 48 324 5720 51 153 Nonnatural (NNP-UC) 57 374 0 57 374 0 57 374 1280 KAUSHIK AND ZHANG folded, sure random, guessed folded, and guessed random. While benchmarking our method, we combined the sure folded and guessed folded as the natural proteins, and the sure random and guessed random as the nonnatural proteins. The summary of the predictions using FoldIndex, FOLD, CS-Score, and CSS-Score for external datasets of natural, nonnatural, and random proteins is shown in Table 3 and Figure S8. For calculating sensitivity and specificity, the protein scored as unreliable prediction are not considered. Some other methods developed for characterization of protein sequences into natural and random proteins,4-6,29,42 could not be independently validated on the dataset of 170 884 proteins due to unavailability of standalone versions. For such methods, the evaluation statistics reported in respective research article is compiled and provided in Table 4. The benchmarking of CS- and CSS-Score with previously reported methods in Table 3 demonstrates a reasonably better performance in terms of sensitivity, specificity, and accuracy. Despite the fact that the accuracy of the methods accounted in Table 4 is adopted from respective research article, which is only restricted to a small dataset of natural and random proteins in most cases, the accuracy of CSSScore clearly outperformed most of these methods except Lucrezia et al4 which is validated on a dataset of 1500 small proteins of 70 TABLE 3 The summary of benchmarking of CS-score and CSS-score with FoldIndex and FOLD on the dataset of 170 884 proteins, comprising 56 637 natural and 114 247 nonnatural proteins Method Predicted natural Predicted nonnatural Unreliable prediction Sensitivity Specificity Percentage accuracy (%) FoldIndex 140 767 16 815 13 302 0.86 0.10 35 FOLD3* 63 941 105 924 1019 0.61 0.74 69 FOLD4* 49 443 107 508 13 933 0.63 0.84 77 FOLD5* 40 719 121 801 8364 0.41 0.82 69 CS-Score 42 278 128 606 0 0.60 0.92 82 CSS-Score 51 596 119 288 0 0.81 0.95 91 TABLE 4 Summary of articles reporting characterization of natural and random proteins by implementing various approaches Method/reference Parameters/approach Dataset (N + R) Accuracy (%) Remark Munteanu et al, 2008 Star network topological indices N = 1046 R = 1046 90 Bias for random Santoni et al, 2016 ML on proximity measure between pair of amino acids N = 1047 R = 10 470 75 Small dataset for natural Garbuzynskiy et al, 2004 Hydrophobicity and contact number N = 80 R = 90 83 Small dataset for natural De Lucrezia et al, 2012 Evolutionary neural network on small protein (70 aa) N = 762 R = 762 94 Only small proteins accounted Tsygvintsev, 2019 Neural network based on time series analysis N = 3502 R = 3502 85 24D vector used in complex training Present study CS-score Competency Scores derived from sequences N = 56 636 R = 114 247 82 Relatively lower accuracy Present study CSS-score Scores derived from sequences and 2 structures N = 56 636 R = 114 247 91 FIGURE 7 A boxplot representation of (A) CS- and (B) CSSScores for four classes of proteins (α, β, α/β, and α + β) KAUSHIK AND ZHANG 1281 amino acid residues length. Also, the methods reported by Munteanu et. al (26) was cross validated by Santoni et al6 to report an accuracy of 79% with a very low true positive rate. 3.4 | Distribution for different protein classes The competency scores are calculated for the unique protein sequences of all alpha (α), all beta (β), alpha and beta (α/β), and alpha plus beta (α + β) proteins representing 289, 178, 148, and 388 protein folds. The average CS-Scores are observed to be 33.9 (±3.40), 32.7 (±3.00), 34.4 (±2.67), and 33.1 (±3.06) for all alpha (α), all beta (β), alpha and beta (α/β), and alpha plus beta (α + β) proteins, respectively. Likewise, the average CSS-Scores are found to be 19.5 (±3.20), 16.5 (±4.25), 18.4 (±2.27), and 16.9 (±2.50) for all alpha (α), all beta (β), alpha and beta (α/β), and alpha plus beta (α + β) proteins, respectively. A boxplot representation of CS- and CSS-Scores is shown in Figure 7 and the additional statistics are provided in Table S4. TABLE 5 Summary of compiled designed proteins from previous research articles Research article Designed proteins Expressed in E. coli Reported soluble Monomeric proteins Solved structures Koga et al, 2012 54 49 45 19 16 Lin et al, 2015 49 49 45 31 10 Koepnick et al, 2019 144 119 99 65 55 Total 247 217 189 115 81 FIGURE 8 Competency score-based analysis of successful (green) designed and failed (red) proteins at (A) expression level, (B) solubility level, (C) oligo-state level, and (D) structural level [Color figure can be viewed at wileyonlinelibrary.com] 1282 KAUSHIK AND ZHANG It is worth noting that in case of all protein classes the average CS- and CSS-Scores are beyond the minimum threshold for natural proteins that is, CS-Score ≥ 32.15 and CSS-Score ≥ 15.50, respectively. However, a further investigation is required to find out if the scores are significantly deviating among different classes of proteins. 3.5 | Performance on reported designed proteins A set of 247 designed protein sequences, reported in some previous research articles43-45 is compiled for calculating the sequence and secondary structure-based competency scores. The experimental results of these designed proteins sequences are available for their expression, solubility, monomeric state, and structure. According to their respective articles, these sequences are selected for experimental validation after screening through some comprehensive scoring functions from more than 100 folds sampled sequences. Since only top ranked protein sequences (less than 0.1% of sampled sequences) are considered for experimental characterization, these are likely to score much higher than the expected competency scores of natural proteins. The details of the designed protein dataset are provided in Table S5 and a summary is provided in Table 5. In total, 81 designed proteins could be solved as well characterized protein tertiary structures using X-ray crystallography and/or NMR methods. The rationale of screening the designed protein sequences using the CS- and CSS-Scores is to quantify the ability of these scores at expression, solubility, oligo-state, and structural level. In Figure 8, the CS- and CSS-Scores of designed proteins accounted in Table 5 are plotted as success (green circles) and failure (red circles) cases at expression, solubility, oligo-state, and structural levels. It is observed that most of the proteins except one scored beyond the minimum threshold of natural proteins for CSS-Score (above 15.50). However, the same is not true for CS-Score as several proteins scored below the minimum threshold (below 32.15). It may also be noted that as we move from expression to solubility to oligo-state to structure, the upper-right quadrant (with CS-Score > 35 AND CSSScore > 25) of the plots remains occupied by successful cases at all four levels. This observation may help in designing novel protein sequences with a higher potential of being successful at experimental validation. 4 | CONCLUSION The infinitesimally small sequence space naturally scouted in the millions of years of evolution suggests that the natural proteins are impeded by some specific prerequisites and should diverge from computationally generated nonnatural protein sequences. Considering this, here we studied natural and computationally generated nonnatural proteins to develop a protein sequence fitness scoring function. The scoring function implements sequence and corresponding secondary structural information at tripeptide levels to differentiate natural and nonnatural proteins. The proposed scoring function is extensively validated on a dataset of about 210 000 natural and nonnatural protein sequences and benchmarked with existing methods for differentiating natural and nonnatural proteins. The high sensitivity, specificity, and percentage accuracy (0.81%, 0.95%, and 91% respectively) of the scoring function demonstrates its potential application for sampling the protein sequences with higher probability of mimicking natural proteins. Also, the four major classes of proteins (α proteins, β proteins, α/β proteins, and α + β proteins) are separately analyzed and β proteins are observed to scoring slightly on the lower side as compared to other classes. Further, an analysis of about 250 designed proteins (adopted from previously reported cases) helped in defining the boundaries for sampling the ideal protein sequences which may prove advantageous in computational protein design regimes. ACKNOWLEDGMENTS Authors are very grateful to Prof. Mihaly Mezei for help with running FOLD3 and FOLD4. We gratefully acknowledge the suggestions from Prof. Jaime Prilusky for automating the predictions using FoldIndex. We acknowledge RIKEN ACCC for the computing resource used in this study. This work is supported by Kakenhi 18H02395 from Japan Society for the Promotion of Science. CONFLICT OF INTERESTS The authors declare no conflicts of interest. DATA AVAILABILITY STATEMENT All the datasets used in the present study are provided at http:// github.com/KYZ-LSB/ComProDes. Additionally, the programs for running the proposed protein sequence fitness scoring function, user tutorial, and readme files are provided for future use of the programs. There is no additional dependency required for running the programs in Linux environment. ORCID Kam Y. J. Zhang https://orcid.org/0000-0002-9282-8045 REFERENCES 1. Kc DB. Recent advances in sequence-based protein structure prediction. Brief Bioinform. 2017;18(6):1021-1032. 2. Ovchinnikov S, Park H, Varghese N, et al. Protein structure determination using metagenome sequence data. Science. 2017;355(6322): 294-298. 3. Trainor K, Broom A, Meiering EM. Exploring the relationships between protein sequence, structure and solubility. Curr Opin Struct Biol. 2017;42:136-146. 4. De Lucrezia D, Slanzi D, Poli I, Polticelli F, Minervini G. Do natural proteins differ from random sequences polypeptides? Natural vs. random proteins classification using an evolutionary neural network. PLoS One. 2012;7(5).e36634. http://dx.doi.org/10.1371/ journal.pone.0036634. 5. Garbuzynskiy SO, Lobanov MY, Galzitskaya OV. To be folded or to be unfolded? Protein Sci. 2004;13(11):2871-2877. 6. Santoni D, Felici G, Vergni D. Natural vs. random protein sequences: discovering combinatorics properties on amino acid words. J Theor Biol. 2016;391:13-20. KAUSHIK AND ZHANG 1283 7. Turjanski P, Ferreiro DU. On the natural structure of amino acid patterns in families of protein sequences. J Phys Chem B. 2018;122(49): 11295-11301. 8. Uversky VN. What does it mean to be natively unfolded? Eur J Biochem. 2002;269(1):2-12. 9. Lu PL, Min DY, DiMaio F, et al. Accurate computational design of multipass transmembrane proteins. Science. 2018;359(6379):1042- 1046. 10. Huang PS, Boyken SE, Baker D. The coming of age of de novo protein design. Nature. 2016;537(7620):320-327. 11. Voet ARD, Noguchi H, Addy C, Zhang KYJ, Tame JRH. Biomineralization of a cadmium chloride nanocrystal by a designed symmetrical protein. Angew Chem Int Edit. 2015;54(34):9857-9860. 12. Brunette TJ, Parmeggiani F, Huang PS, et al. Exploring the repeat protein universe through computational protein design. Nature. 2015; 528(7583):580. 13. Voet ARD, Noguchi H, Addy C, et al. Computational design of a selfassembling symmetrical beta-propeller protein. Proc Natl Acad Sci U S A. 2014;111(42):15102-15107. 14. Burke AJ, Lovelock SL, Frese A, et al. Design and evolution of an enzyme with a non-canonical organocatalytic mechanism. Nature. 2019;570(7760):219. 15. Langan RA, Boyken SE, Ng AH, et al. De novo design of bioactive protein switches. Nature. 2019;572(7768):205. 16. Wang TT, Fan XT, Hou CX, Liu JQ. Design of artificial enzymes by supramolecular strategies. Curr Opin Struct Biol. 2018;51:19-27. 17. Welborn VV, Head-Gordon T. Computational design of synthetic enzymes. Chem Rev. 2019;119(11):6613-6630. 18. Leelananda SP, Jernigan RL. Diversity of sequences folding to highly and poorly designable structures. Biophys J. 2012;102(3):456. 19. Tian PF, Best RB. How many protein sequences fold to a given structure? A coevolutionary analysis. Biophys J. 2017;113(8):1719-1730. 20. Mezei M. On predicting foldability of a protein from its sequence. Proteins. 2019;88(2):355–365. 21. Laurenzi A, Hung LH, Samudrala R. Structure prediction of partiallength protein sequences. Int J Mol Sci. 2013;14(7):14892-14907. 22. LaBean TH, Butt TR, Kauffman SA, Schultes EA. Protein folding absent selection. Genes. 2011;2(3):608-626. 23. Angyan AF, Perczel A, Gaspari Z. Estimating intrinsic structural preferences of de novo emerging random-sequence proteins: is aggregation the main bottleneck? FEBS Lett. 2012;586(16):2468-2472. 24. Weiss O, Jimenez-Montano MA, Herzel H. Information content of protein sequences. J Theor Biol. 2000;206(3):379-386. 25. Pande VS, Grosberg AY, Tanaka T. Nonrandomness in protein sequences - evidence for a physically driven stage of evolution. Proc Natl Acad Sci U S A. 1994;91(26):12972-12975. 26. Mackenzie CO, Zhou JF, Zheng F, Grigoryan G. A tertiary alphabet for the observable protein structural universe captures sequencestructure relationships. Protein Sci. 2016;25:75-76. 27. Szoniec G, Ogorzalek MJ. Entropy of never born protein sequences. Springerplus. 2013;2(1):200 28. Peto M, Kloczkowski A, Honavar V, Jernigan RL. Use of machine learning algorithms to classify binary protein sequences as highlydesignable or poorly-designable. BMC Bioinform. 2008;9(1):487. http://dx.doi.org/10.1186/1471-2105-9-487. 29. Munteanu CR, Gonzalez-Diaz H, Borges F, de Magalhaes AL. Natural/random protein classification models based on star network topological indices. J Theor Biol. 2008;254(4):775-783. 30. Kabat EA, Wu TT. The influence of nearest-neighbor amino acids on the conformation of the middle amino acid in proteins: comparison of predicted and experimental determination of -sheets in concanavalin A. Proc Natl Acad Sci U S A. 1973;70(5):1473-1477. 31. Xia X, Xie Z. Protein structure, neighbor effect, and a new index of amino acid dissimilarities. Mol Biol Evol. 2002;19(1):58-67. 32. Borguesan B, Inostroza-Ponta M, Dorn M. NIAS-server: neighbors influence of amino acids and secondary structures in proteins. J Comput Biol. 2017;24(3):255-265. 33. DasGupta D, Kaushik R, Jayaram B. From Ramachandran maps to tertiary structures of proteins. J Phys Chem B. 2015;119(34):11136- 11145. 34. Chandonia JM, Fox NK, Brenner SE. SCOPe: classification of large macromolecular structures in the structural classification of proteinsextended database. Nucleic Acids Res. 2019;47(D1):D475- D481. 35. Fu LM, Niu BF, Zhu ZW, Wu ST, Li WZ. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28 (23):3150-3152. 36. Bateman A, Martin MJ, Orchard S, et al. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506-D515. 37. Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16(6):276-277. 38. Heinig M, Frishman D. STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res. 2004;32:W500-W502. 39. Buchan DWA, Jones DT. The PSIPRED protein analysis workbench: 20 years on. Nucleic Acids Res. 2019;47(W1):W402-W407. 40. Prilusky J, Felder CE, Zeev-Ben-Mordehai T, et al. FoldIndex([c]): a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005;21(16):3435-3438. 41. Mezei M. Exploiting sparse statistics for a sequence-based prediction of the effect of mutations. Algorithms. 2019;12(10):214-220. 42. Tsygvintsev A. Natural vs. random protein sequences: the novel neural network approach based on time series analysis. Journal of Proteins and Proteomics. 2020;11(1):11–16. http://dx.doi.org/10.1007/ s42485-020-00029-8. 43. Koepnick B, Flatten J, Husain T, et al. De novo protein design by citizen scientists. Nature. 2019;570(7761):390. 44. Lin YR, Koga N, Tatsumi-Koga R, et al. Control over overall shape and size in de novo designed proteins. Proc Natl Acad Sci U S A. 2015;112 (40):E5478-E5485. 45. Koga N, Tatsumi-Koga R, Liu G, et al. Principles for designing ideal protein structures. Nature. 2012;491(7423):222-227. SUPPORTING INFORMATION Additional supporting information may be found online in the Supporting Information section at the end of this article. How to cite this article: Kaushik R, Zhang KYJ. A protein sequence fitness function for identifying natural and nonnatural proteins. Proteins. 2020;88:1271–1284. https:// doi.org/10.1002/prot.25900 1284 KAUSHIK AND ZHANG