Sequence Analysis Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function Amelia Villegas-Morcillo*'1^, Stavros Makrodimitris*'2'3, Roeland C.H.J, van Ham2'3, Angel M. Gomez1, Victoria Sanchez1 and Marcel J.T. Reinders2,4 1 Dept. of Signal Theory, Telematics and Communications, University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071, Granada, Spain 2Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6,2628XE, Delft, the Netherlands, 3Keygene N.V., Agro Business Park 90,6708PW, Wageningen, the Netherlands and 4Leiden Computational Biology Center, Leiden University Medical Center, Einthovenweg 20,2333ZC, Leiden, the Netherlands. 'Equal contribution tTo whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results: We applied an existing deep sequence model that had been pre-trained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that three-dimensional structure is also potentially learned during the unsupervised pre-training. Availability: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. Contact: ameliavm@ugr.es Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction Proteins perform most of the functions necessary for life. However, proteins with a well-characterized function are only a small fraction of all known proteins and mostly restricted to a few model species. Therefore, the ability to accurately predict protein function has the potential to accelerate research in fields such as animal and plant breeding, biotechnology, and human health. The most common data type used for automated function prediction (AFP) is the amino acid sequence, as conserved sequence implies conserved function (Kimura and Ohta, 1974). Consequently, many widely-used AFP algorithms rely on sequence similarity via BLAST (Altschul et al, 1990) and its variants or on hidden Markov models (Eddy, 2009). Other types of sequence information that have been used include fc-mer counts, predicted secondary structure, sequence motifs, conjoint triad features and pseudo-amino acid composition (Cozzetto et al, 2016; Fa et al, 2018; Sureyya Rifaioglu et al, 2019). Moreover, Cozzetto et al. showed that different sequence features are informative for different functions. More recently, advances in machine learning have partially shifted the focus from hand-crafted features, such as those described above, to 1 © The Aufhor(s) 2020. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.Org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contactjournals.permissions@oup.com 2 Villegas-Morcillo et al. a) Protein sequence of length L ... ERQFFRDSDTPYESFLYKAAP... i I ELMo One-hot embeddings encodings 1024 Average II- 1024 26 li i b) Structural features Amino acid-level features k-mer counts 1024 17 Protein-level features Protein 3D structure DeepFold -[ ll 1 398 Contact map (LxL) Image Graph adjacency matrix tü175M in UniProtKB). Although these sequences cannot be directly used to train an AFP model, they can be fed into an unsupervised deep model that tries to learn general amino acid and/or protein features. This learned representation can then be applied to other protein-related tasks, including AFP, either directly or after fine-tuning by means of supervised training. Several examples of unsupervised pre-training leading to substantial performance improvement exist in the fields of computer vision (Doersch et al., 2015; Gidaris et al., 2018; Mathis et al., 2019) and natural language processing (NLP) (McCann et al., 2017; Peters etal.,2018; Devlin et al, 2018). In bioinformatics, pre-training was shown to be beneficial for several deep neural network architectures on protein engineering and remote homology detection tasks (Rao et al, 2019). A deep unsupervised model of protein sequences was recently made available (Heinzinger et al, 2019). It is based on the NLP model ELMo (Embeddings from Language Models) (Peters et al, 2018) and is composed of a character-level CNN (CharCNN) followed by two layers of bidirectional LSTMs. The CNN embeds each amino acid into a latent space, while the LSTMs use that embedding to model the context of the surrounding amino acids. The hidden states of the two LSTM layers and the latent representation are added to give the final context-aware embedding. These embeddings demonstrated competitive performance in both amino acid and protein classification tasks, such as inferring the protein secondary structure, structural class, disordered regions, and cellular localization (Heinzinger et al, 2019; Kane et al, 2019). Other works also trained LSTMs to predict the next amino acid in a protein sequence using the LSTM hidden state at each amino acid as a feature vector (Gligorijevic et al, 2019; Alley et al, 2019). Finally, a transformer neural network was trained on 250 million protein sequences, yielding embeddings that reflected both protein structure and function (Rives et al, 2019). Protein function is encoded in the amino acid sequence, but sequences can diverge during evolution while maintaining the same function. Protein structure is also known to determine function and is -in principle- more conserved than sequence (Wilson et al, 2000; Weinhold et al, 2008). From an AFP viewpoint, two proteins with different sequences can be assigned with high confidence to the same function if their structures are similar. It is therefore generally thought that combining sequence data with 3D structure leads to more accurate function predictions for proteins with known structure, especially for those without close homologues. Structural information is often encoded as a protein distance map. This is a symmetric matrix containing the Euclidean distances between pairs of residues within a protein and is invariant to translations or rotations of the molecule in 3D space. One can obtain a binary representation from this real-valued matrix, called protein contact map, by applying a distance threshold (typically from 5 to 20 A). This two-dimensional representation successfully captures the overall protein structure (Bartoli et al, 2007; Duarte et al, 2010). The protein contact map can be viewed as a binary image, where each pixel indicates whether a specific pair of residues are in contact or not. Alternatively, it can be interpreted as the adjacency matrix of a graph, where each amino acid is a node and edges represent amino acids that are in contact with each other. In order to extract meaningful information from contact maps, both two-dimensional CNNs (Zhu et al, 2017; Zheng et al, 2019) and graph convolutional networks (GCNs) (Fout et al, 2017; Zamora-Resendiz and Crivelli, 2019) have been proposed. Only (Gligorijevic et al., 2019) have explored the effectiveness of a pre-trained sequence model in AFP, but it was done in combination with protein structure information using a GCN. We suspect that a deep pre-trained embedding can be powerful enough to predict protein function, in which case the structural information would not offer any significant performance improvement. Therefore, we set out to evaluate pre-trained ELMo embeddings in the task of predicting molecular functions, by comparing them to hand-crafted sequence and structural features in combination with Function prediction from unsupervised protein embeddings 3 3D structure information in various forms. We focus on the Molecular Function Ontology (MFO), as it is the most correlated ontology to sequence and structure (Anfinsen, 1973), but also perform small-scale experiments on Biological Process Ontology (BPO) and Cellular Component Ontology (CCO). Fig. 1 provides an overview of the data and models used in our experiments. We demonstrate the effectiveness of the ELMo model (Heinzinger et al, 2019) and show that protein structure does not provide a significant performance boost to these embeddings, although it does so when we only consider a simple protein representation based on one-hot encoded amino acids. 2 Materials & Methods 2.1 Protein representations We considered two types of representations of the proteins (Fig. 1). The first one describes the sequence using amino acid features and the second one the three-dimensional structure, in the form of distance maps. For each sequence of length L, we extracted amino acid-level features using apre-trained unsupervised language model (Heinzinger et al., 2019). This model is based on ELMo (Peters et al., 2018) and outputs a feature vector of dimension