R E S E A R C H A R T I C L E
Learning a functional grammar of protein domains using
natural language word embedding techniques
Daniel W. A. Buchan | David T. Jones
Department of Computer Science, University
College London, London, UK
Correspondence
David T. Jones, Department of Computer
Science, University College London, Gower
Street, London WC1E 6BT, UK.
Email: d.t.jones@ucl.ac.uk
Funding information
Biotechnology and Biological Sciences
Research Council, Grant/Award Number: BB/
M011712/1
Peer Review
The peer review history for this article is
available at https://publons.com/publon/10.
1002/prot.25842.
Abstract
In this paper, using Word2vec, a widely-used natural language processing method,
we demonstrate that protein domains may have a learnable implicit semantic “meaning”
in the context of their functional contributions to the multi-domain proteins in
which they are found. Word2vec is a group of models which can be used to produce
semantically meaningful embeddings of words or tokens in a fixed-dimension vector
space. In this work, we treat multi-domain proteins as “sentences” where domain
identifiers are tokens which may be considered as “words.” Using all InterPro (Finn
et al. 2017) pfam domain assignments we observe that the embedding could be used
to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown
function.
K E Y W O R D S
function prediction, machine learning, protein domains, semantic embedding, word2vec
1 | INTRODUCTION
Word2vec1
is a group of models which can be used to learn the
embeddings of words in a continuous fixed-dimension vector space,
given a corpus of sentences as training data. Often Natural Language
Processing (NLP) tasks consider words as sets of unrelated tokens,
subjecting them to no-more rigorous analysis than simple frequency
counting. While this is mathematically and computationally convenient,
it ignores the fact that most words have degrees of similarity,
such as verbs with differing tenses, adverbs with differing endings or
words which share the same suffixes. Word2vec aims to produce
embeddings of words in a vector space where distance in the vector
space correctly encodes the degree to which words or terms are similar
or can be used in similar semantic context. Although a great degree
has been written about these methods it remains unclear exactly why
these models are so performant.2
Nevertheless, they do show good
performance in the task of clustering words with related semantic
meaning, and interested readers should consult the original paper for
further details of the model.1
Since lexical word embeddings have
become popular, they have been adapted and applied directly to protein
and gene sequences. prot2vec, gene2vec, and seq2vec are examples
of such methods.3,4
Another prior application of Word2vec is the
work of Viehweger,5
applying protein domain embeddings as a
method to encode whole genomes.
Proteins are often composed of discrete domains, and these can
either be conceptualized as sub-sequences of independent protein
sequences which share homology (and by extension evolutionary
origin),6
or alternatively, domains may be considered structurally,
where they are subsections of the proteins which are compact, independently
folding and observed to be shared between a variety of
proteins.7-9
An extension of this observation, that proteins can be
decomposed into sets of domains, is the hypothesis that domains act
as sub-functional units and when composed together, a protein's
given combination of domains is what gives rise to the protein's overall
specific function10,11
In the following study, we show that protein
domains can be embedded in a “semantically” meaningful vector space
and that this embedding space reflects meaningful information about
the functional roles (in terms of GO term assignments) of the individual
protein domains.
Protein function prediction has received a great deal of attention
in the preceding 20 years12
and a great number of function prediction
methods have been developed. Many of these make use of sequence
comparison and some manner of nearest neighbor functional assign-
ment.13,14
As the field has progressed work has been carried out to
integrate more sophisticated statistical methods and models with
Received: 10 July 2019 Revised: 8 October 2019 Accepted: 3 November 2019
DOI: 10.1002/prot.25842
616 © 2019 Wiley Periodicals, Inc. Proteins. 2020;88:616–624.wileyonlinelibrary.com/journal/prot
many contemporary methods leveraging machine learning with
ensemble or meta-prediction methodologies. Current state of the art
in protein function is measured by the Critical Assessment in Function
Annotation (CAFA) community experiment.15
In this experiment
groups, attempt to predict experimentally validated Gene Ontology
(GO) terms16
over a blind set of unannotated protein sequences. The
most successful methods in CAFA employ a wide variety of predictive
methodologies. Among the most common are methods which integrate
data and annotations from a wide variety of sources including
blast searches, protein–protein interaction networks, multiple
sequence alignment analysis, sequence analysis, expression data, and
many more.17-20
A number of other successful methodologies eschew
integrating heterogenous data sources and make use of more focused
analyses, such as phylogenetic analysis,21
literature analysis,22
MSA
analysis,23
domain function analysis [24, 25]. Information about protein
domains is typically only included indirectly, such as in the
methods INGA and PFPDB which make use of Pfam to derive phylogenetic
or domain architecture patterns. Less common are methods
which directly attempt to annotate domains with function and then
leverage this information for function prediction. Both the SIFTER,
CATH-Funfam,24
and Superfamily-dcGO25
methods in CAFA were
successful methods which directly leverage such domain function
annotations. It is clear that understanding the relationship between
protein domains and their function can make a significant contribution
to accurate function prediction. Nevertheless, even with the wide
range of prediction methodologies, performance and progress in the
CAFA experiment indicates that protein function prediction remains a
challenging problem in the field of bioinformatics.
In the following work, we discuss the use of Word2vec in protein
domain embedding. We prepare such a domain embedding and
attempt to explore the its properties to discern whether such embeddings
encode biological information that may be useful in either a predictive
or analytic context. Such embeddings may be a useful adjuncts
or input features in protein function prediction as it may give a
homology-free way to characterize and functionally cluster protein
domains. At the end of the paper we note that such an embedding
could be used for the purposes of homology-free GO term inheritance
and we show a naïve application of this for Pfam Domains of
Unknown Function.
2 | METHOD
2.1 | Datasets
InterPro 6226
was downloaded along with the associated GO and protein
domain assignments. The files were parsed to extract only the
eukaryotic proteins and their GO and Pfam domain assignments. This
work looks only at eukaryotic proteins as there are few examples of
proteins with multiple domains with independent evolutionary histories
in the bacterial and archaeal kingdoms, as such little domain context
information would be available for proteins from those kingdoms.
Only GO assignments with the following evidence codes were
retained: EXP, IBA, IDA, IEP, IGC, IGI, IMP, and IPI. These are
(respectively); inferred from EXPeriment, Inferred from Biological
Aspect of ancestor, inferred from Direct Assay, Inferred from Expression
Pattern, Inferred from Genomic Context, Inferred from Genetic
Interaction, Inferred from Mutant Phenotype and Inferred from Physical
Interaction. This eliminates all the high throughput and more tenuous
computational annotation assignments. The resulting dataset
contains 9 030 650 eukaryotic proteins, which have domain assignments
over 11 355 of the available Pfam domain families and these
proteins are associated with annotations from 2358 GO Terms.
Not all regions within each protein have been assigned to domains
(see Table 1). In large part because not all domains are known and
assigned but also because many eukaryotic proteins possess regions
of intrinsic disorder,27
regions of low complexity or coiled-coiled
sequences. All such unassigned regions were compiled (see below). As
Word2vec analyses words based on the semantic context of neighboring
words representing unassigned regions in our corpus could
contain important domain context information, and so we wished to
preserve this.
These data were then used to derive which Pfam domains are
seen to be associated to which GO terms. For every Pfam domain, we
associated all GO terms assigned to all the proteins the Pfam domain
was observed in. This assigns a varied bag of GO terms to each Pfam
domain and this bag of terms can be viewed as representing the spectrum
of observed functional diversity for that Pfam domain.
2.2 | Unassigned sequence region assignments
The sequence database for InterPro 62 was masked for both coiled
coil and low complexity regions using pfilt.28
Disordered regions were
derived directly from the existing InterPro disorder annotations. Gap
regions which did not contain disorder annotations, coiled-coil or low
complexity sequence were assigned given the length of the
unassigned regions. These remaining gap regions were binned into
size bins based on their lengths (see Figure 1). The majority of gap
regions are around 100 residues in length, as the typical structural
domain size is around 100 residues five gap types were created to
represent unassigned regions of various sizes which are approximate
multiples of the typical domain size, see Table 2. All non-domain
regions: gaps, disordered, low complexity, and coiled-coil regions were
TABLE 1 Table of the total residue counts across the eukaryotic
Interpro protein set and the number of residues assigned to each class
of domain or region
Class Residue count Percentage
Total 5 001 517 961 —
Domains 1 256 832 058 25.1
Gaps 3 405 089 896 68.1
Disordered 167 103 753 3.3
Coiled coil 3 309 167 0.06
Low complexity 2 079 334 0.04
BUCHAN AND JONES 617
then compiled as a set of adjunct domain-like sequence regions to
complement the Pfam domain assignments.
2.3 | Building the word embedding
To build Word2vec embeddings, we treat protein sequences and their
domain assignments as “sentences.” The Pfam IDs and other sequence
region assignments are used as tokens/pseudo-words in such a pseudosentence.
For instance, a typical protein may be converted to a sentence
such as “PF00170 PF003534 G200 LowComplexity PF00678.”
Which would indicate two leading Pfam domains followed by a gap
region up to 200 residues, a region of low complexity sequence finally
terminating in a Pfam domain (see Figure 2). We compile such sentences
for every eukaryotic protein in InterPro62 and this set of sentences
becomes the corpus we use to create the word embedding.
Python library gensim (https://radimrehurek.com/gensim/) was
used to create the word2vec model from the corpus. The size parameter
was set to 100, representing the dimensionality of the vector
space to project the words in to. The minimum word count was set to
0, indicating that all words would be positioned in the vector space.
This ensures that all domains, including important infrequent ones are
considered, also the embedding uses the skip-gram algorithm and
model to build the embedding. The goal of Word2vec is to learn the
weights in the hidden layer of a simple neural network, this hidden
layer is an n by m matrix, where n is the number of input words in the
corpus and m is the size parameter (eg, 100). To train these weights
the network is given a training task, the skip-gram task, which asks
the network to predict, for each word in turn to output the probability
that other words from the corpus are near to the input word (ie,
within a given window size, in this instance a window of 5). Once the
training is complete the output probabilities are discarded and only
the weights of the hidden layer are retained as this matrix is regarded
as the word embedding. It is possible to develop alternative training
0e+00
2e+07
4e+07
6e+07
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400
Gap lengths
Numberofgapsobserved FIGURE 1 Distribution of gap
regions (regions without Pfam domain
assignments) in InterPro eukaryotic
sequences
TABLE 2 Names and sizes of gap pseudo-domains and the
number of interpro proteins where we observe at least one of these
regions
Gap region ID Size (residues) Protein count
G100 20–100 4 234 931
G200 101–200 2 635 225
G300 201–300 1 168 553
G400 301–400 575 517
G500 401–>500 926 673
FIGURE 2 The example of the domain and sequence region
assignment. Pfam domains and disorder regions are derived from
InterPro annotations. Low complexity and coiled coil regions are
calculated by pfilt and gaps are assigned given their size [Color figure
can be viewed at wileyonlinelibrary.com]
618 BUCHAN AND JONES
tasks to learn the embedding matrix. A target behavior of Word2vec
is that words which fulfill similar semantic roles should be near one
another in the embedding and it is believed that the skip-gram task,
by having the network learn about which words are local to one
another, in turn is encoding this information in the weights of the hidden
layer.
The embedding process is illustrated full in Figure 3. For the benchmark
below an all-against-all distance matrix of domains was derived.
2.4 | Benchmark
We are interested in whether Word2vec embeds Pfam domains in a
manner which is biologically meaningful. This would in turn would indicate
that there is some manner of semantic meaning in the positioning
or sequence context for protein domains. To investigate the embedding,
initially we attempted to project the domain vectors into three dimensions
(data not shown) using multi-dimensional scaling. However, the
resulting projection did not yield any trivially interpretable result.
An alternative means of investigating whether the embedding is
biologically meaningful would be to establish if functionally related
domains are placed near one another in the embedding. To investigate
this, we assigned GO terms to the Pfam domains. This was carried
out by allowing Pfam domains to inherit all GO terms assigned
to the proteins each Pfam domain is observed in. Pfam domains
inherit an average of 19.6 GO terms, although some domains may
have upwards of 100 terms associated, see Figure 4. Although this is
somewhat imprecise, as GO annotations reflect protein functions
rather than domain function, each domain's “bag” of GO terms will
reflect the functional diversity for the contexts a domain is observed
in. A total of 2358 GO terms were assigned over the 11 355 Pfam
domains observed in the eukaryotic proteins. These assignments
could then be used for a nearest-neighbor benchmark test.
3 | RESULTS
3.1 | Nearest-neighbor performance
Performance in nearest neighbor functional annotation was calculated to
assess whether the vector embedding of domains displayed any meaningful
structure. That is, domains with similar functionality were placed
near one another in the embedding. Each domain was in turn considered
by inheriting the GO terms from its k-nearest neighbors and comparing
these predicted terms to the known terms assigned via InterPro annotations.
Table 3 gives the precision and Matthew's Correlation Coefficients
(MCC) scores for the nearest-neighbor benchmark. The MCC value indicates
the predicted terms are non-random (greater than 0) which in turn
suggests that there is some meaningful structure in the embedding of
domains in a vector space. Mean accuracy is high and this is a consequence
of there being a very large number of GO terms where typically
only a few (relatively) are used to annotate any given protein or domain.
This in turn means any given domain has very large numbers of true negatives
most of which are called correctly. As K is increased recall also
increases as the bag of assigned terms gets very large but this comes at
the cost of a sharply declining precision.
FIGURE 3 Compiling protein “sentences.” InterPro compiles assignments of domains on Uniprot protein sequences. We take only the Pfam
domain assignments the InterPro database stores and complement those with the assignments of disorder and our own low complexity (LC) and
coiled-coil (CC) region assignments. These are then tokenized to create a corpus of “sentences.” The corpus can then be used as input to
Word2vec. The output is a vector space which places each token at a point within that space, here stylized in two-dimensional. Tokens which
appear in similar syntactic contexts in the corpus should be placed near one another in the vector space
BUCHAN AND JONES 619
Word2vec is designed to embed human language words in a vector
space such that words which occur in similar semantic contexts
are close to one another in the vector space. That our domain embedding
is non-random implies that multidomain proteins exhibit some
form of semantic structure. That is, certain domains appear in contexts
near or adjacent to other domains and it may be possible to
learn grammar-like rules which govern this.
It is worth noting that increasing the number of neighbors (increasing
K) from which functional roles can be inherited degrades performance in
this function-annotation task. Domains are typically involved in a large
number of possible different protein functions. By increasing the number
of neighbors, GO terms can be inherited from the number of false positives
is greatly increased and so performance degrades.
3.2 | Per ontology results
MCC values were also calculated for each of the three GO Ontologies
(see Table 4). Of the 2358 GO terms used to annotate eukaryotic
sequences in InterPro: 1,018 are from the molecular function ontology,
1,026 are from the biological process ontology and 314 from the
cellular component ontology. The MCC values indicate different functional
inheritance performance for each ontology with. In the context
of the vector embedding this may imply that the simple syntax contained
in the domain orderings contains some additional information
about where a protein is located within the cell. Given the results of
the previous CAFA experiment15
it may, more simply, be that cellular
component prediction is an easier task.
In general, we believe the MCC calculated may underestimate the
quality of the domain embedding. Given the figures in Table 1 we see
that nearly 70% of the proteins are gap regions. This indicates many
domain assignments and domain types may be missing. We would
expect with better domain coverage we would also have a more
robust and biologically meaningful embedding.
Alongside this, using GO assignments to genes to annotate domains
is inherently noisy. GO annotations may not be good descriptors of the
specific role a domain plays in a given protein. For instance,
GO:0051987 (Chaperone Binding), assigned to 92 Pfam domains, might
be regarded as property or function of a whole protein rather than just a
specific domain. An alternative issue is illustrated by Pfam domain
PF00176 which is assigned both GO:0009916 (alternative oxidase activity)
and GO:0001733 (galactosylceramide sulfotransferase activity).
These assignments come through differing InterPro proteins but represent
different catalytic reaction chemistries this domain is unlikely to
possess both of these. Within the context of a multidomain proteins,
domains provide specific sub-functionality such as providing catalytic
sites, presenting one or more small molecule binding sites, providing
membrane anchoring and so forth. It seems plausible if domains were
annotated at a level, that better reflected these more specific subfunctional
roles (rather than the protein's role), then the nearest-neighbor
assignment would return better results. The lack of a computer readable
“domain ontology” remains a barrier for large scale studies of domain
functionality and evolution.
3.3 | Comparison to first-order Markov
representation
As sets of domains are sequences of symbols or states, it is possible to
represent the information contained in the corpus of domain strings as a
Markov process. We also investigated whether the Word2vec domain
embedding was a more robust representation of the information contained
in the domain corpus than a first-order Markov process. Parsing
the corpus of proteins, a table of the transition probabilities of all
domains against all domains was prepared. A given domain's immediate
context can be read from the table as the rows give the probabilities of
0
1000
2000
3000
0 100 200
Number of Assigned GO Terms
NumberofPFAMdomains
FIGURE 4 Distribution of Gene Ontology term assignments
TABLE 3 Mean precision and
accuracy and Matthew's Correlation
Coefficients (MCC) given nearest
neighbor inheritance of Gene Ontology
terms
k-Nearest neighbors Mean precision Mean recall Mean accuracy Mean MCC
1 0.33 0.30 0.99 0.28
3 0.23 0.42 0.98 0.28
5 0.18 0.49 0.98 0.26
10 0.12 0.57 0.96 0.23
TABLE 4 Matthew's correlation coefficients (MCC) values for
nearest neighbor inheritance of Gene Ontology (GO) terms, calculated
for each separate GO ontology
k
Ontology 1 3 5 10
Biological process 0.27 0.20 0.19 0.17
Molecular function 0.30 0.23 0.22 0.19
Cellular component 0.33 0.22 0.22 0.20
620 BUCHAN AND JONES
the following domain and columns indicate the probabilities of preceding
domains. It follows that pairs of domains which share both similar row
and column vectors are used in the same context in multidomain proteins.
A distance matrix of Euclidean distances between all domains' vectors
was prepared and the nearest-neighbor assignment analysis was
described above was performed, the results can be seen in Table 5.
These results indicate that the Word2vec domain embedding is substantially
better at encoding the biological information contained in the corpus
of multidomain proteins. The comparison may not be completely
equivalent, Markov probabilities take in to account only the preceding
symbol (or symbols in higher order chains) whereas the Word2vec
method considers a window of tokens around each domain, and this feature
is likely a better match for modeling protein domain placement.
Considering the incoming and outgoing probabilities for each domain
could be considered equivalent to considering a window of three
domains. The default window size for Word2vec is 5. This comparison
may under report the performance of a Markov process to model this
data. However, the corpus of multidomain proteins only contains a tiny
fraction of the possible 3- and 5-mers of domains and with many
unassigned regions it getting accurate probabilities may not be possible.
3.4 | Vector arithmetic on the domain embeddings
One observation of semantic embeddings of natural languages is that
arithmetic operations on the vectors frequently have semantic or lexical
meanings, one classic example being:
King – Man + Woman = Queen:
We wished to investigate if simple vector arithmetic or translations
for the protein domain embedding might have similar lexical meaning.
In the King to Queen example (see Figure 5), subtracting Man
from King takes you to a space in the embedding with the meaning of
man “removed” such that adding the Woman vector will take you to
Queen. We can perform similar vector subtractions for the domain
embedding. In this context, we would treat a domain's set of GO
terms as equivalent to its “meaning,” although, as discussed, this is a
lossy way to conceptualize the meaning of a domain. Nevertheless, if
we subtract two domain vectors we would hope the third vector is in
a space where the remaining set of GO terms is the set difference of
the two domains.
We took the most common 20 Pfam domains, removing the one
that is not present in eukaryotes and in turn subtracted all possible
domain vectors. For the resulting third vector, we found the nearest
domain and tested the GO term overlaps with the initial two domains.
In nearly all cases the resulting domain has minimal GO term overlaps
with its parents. It is clear that this operation moves us to a region in
the vector space where the domains' “meaning” is profoundly altered,
much as removing Man from King might be thought of as moving to a
gender-neutral space. What is not clear is what is the functional
meaning of this in protein domain terms.
To investigate whether we could find more meaningful movements
in the vector space we looked instead for translations in the
vector space between mutually exclusive binary annotations. King and
Queen are typically used as mutually exclusive labels that straddle
some conceptual binary assignment (ie, gender) and much the same is
true of many GO terms. For instance, in the cellular component ontology
annotation, terms such as intracellular and extracellular might be
viewed as a similar mutually exclusive binary.
We chose three binary cellular component term pairs; intracellular
(GO:0005622) vs extracellular (GO:0005615), nucleus (GO: 0005634) vs
cytoplasm (GO: 0005737), and cytoplasm (GO: 0005737) vs transmembrane
(GO: 0009279). For each pairing, we identified proteins with
domains annotated exclusively with one term and not the other term.
Then for the first term we calculated the vector which moves from the
location of the domain with the first term to the closest domain annotated
with the second term. As with the prior analysis not having a detailed
domain ontology prevents us from knowing if this closest domain is the
most appropriate domain to move to. This led to a population of translation
vectors which we could test to measure if the translation from a
domain with one term to a domain with the other term was always vector
oriented in a similar direction. We compared all Intracellular to extracellular
vectors in an all against all fashion and did the same for the other two pairs
of terms (see Figure 6). If the translation is preserved in the vector space,
we would expect that all the vectors to have a small angle of deflection
between them. In the transmembrane case, there was no such alignment
and no trend in the angles between the vectors. In both, the intracellular
to extracellular and the nucleus to cytoplasmic cases, there is a clear distribution
which peaks around 1.5 rad, indicating that in general the translation
is commonly orthogonal and is not preserved in the vector space. This
stands somewhat at odds with the prior observation that vector arithmetic
which encodes semantic translations is a general property of these embeddings.
The caveat to make here is that our embedding may not of high
enough quality to perform this analysis productively. As noted above there
may not be enough domain coverage to robustly place the domains in the
embedding space. Alternatively when choosing the domain pairs, the closest
paired domain may not be the correct domain to calculate the angle
between either we have selected the wrong extant domain or the correct
domain is yet to be added to Pfam.
However, the intracellular to extracellular histogram shows a small
leading tail below 1 rad (see Figure 7) indicative of a small population
of vectors which do approach alignment. And indeed, we are able to
find small numbers of genes in InterPro which share Pfam domains
and where the difference is a substitution of one or more intracellular
TABLE 5 Comparison of Matthew's correlation coefficients
(MCC) performance between first-order Markov encoding and the
Word2vec embedding of the domain corpus
k-Nearest Neighbors
Mean MCC
Word2vec Markov
1 0.28 0.13
5 0.28 0.14
5 0.26 0.14
10 0.23 0.11
BUCHAN AND JONES 621
annotated domains for extracellular domains. Two examples, such as
G3I6X9 (solute carrier family 25 member 46) and A0A0L6WZ71 (glycogen
debranching enzyme) or I3L0A0 (Human Transcript
TMEM189-UBE2BV1) and G7Y5H3 (Ubiquitin-conjugating enzyme
E2 L3), see Figure 8. The first pair, G3I6X9 and A0A0L6WZ71, have
respectively extracellular and intracellular functions. The second pair;
G7Y5H3 has a cytoplasmic function but it is less clear what the role
of I3L0A0 might be. The fact that this appears to work in some limited
cases may suggest that an embedding based on a dataset with much
greater domain coverage might be more accurate.
3.5 | Domains of unknown function
As the Word2vec embedding has some meaningful structure with
regards GO term inheritance we can also use a nearest neighbor
approach to suggest putative sets of GO terms that each eukaryotic
Pfam domain of unknown function (Pfam DUFs) may take part in. This
allows a homology-free way to estimate GO assignments. Our corpus
of eukaryotic genes contained annotations from 3918 DUFs. Using a
single nearest neighbor inheritance method, 1292 of these domains
could be assigned new GO terms (ie, their nearest neighbor in the
embedding was annotated and was not a gap or other sequence
region). On average each DUF gets 11 novel GO terms assigned.
Surveying the GO assignments, we note that the mean ontology
depth for each assigned term (ie, the shortest number of steps from
an assigned term to the root of the ontology) is a depth on the graph
4.9 steps from the root of the ontology. The distribution of assigned
term depths is also somewhat positively skewed (data not shown).
(A) (B) (C)
FIGURE 5 Example demonstrating semantically meaningful vector algebra. In A, four terms are placed in the vector space. If we subtract the
Man vector from King (graph B), we move to an undefined point in the vector space. Adding the Woman vector C, moves to the Queen vector
[Color figure can be viewed at wileyonlinelibrary.com]
(A) (B) (C) (D)
FIGURE 6 Comparing translation vector from one binary Gene Ontology property to another. A, Putative vector embedding of intracellular
(blue dots) and extracellular (orange crosses) labeled domains. B, Vectors which translate each intracellular domain to its closest extracellular
labeled domain. C, Vectors are extracted and pooled D, angle between each vector is compared to find vectors that point in the same direction
[Color figure can be viewed at wileyonlinelibrary.com]
0e+00
1e+05
2e+05
1 2
Radians
Count
FIGURE 7 Histogram of transformation vector angles. For
intracellular to extracellular
622 BUCHAN AND JONES
The BP, MF, and CC ontologies have maximum depths of 16, 16, and
11, respectively. This indicates that the typical term assignments are
somewhat general, closer to generic terms such as “protein binding”
rather than terms which indicate explicit functional roles, such as catalytic
mechanisms. In Figure 9, the distribution of terms indicates that
the majority of DUFs receive only a handful of putative GO assignments.
We suggest that such assignments could be used as general
starting points for Pfam domain annotations and with relatively fewer
terms to confirm in most these should not make such annotation tasks
more onerous or obfuscated. We make these annotations available
(see Appendix S1) and note they could make a starting point for future
annotation of these domains in Pfam.
4 | DISCUSSION
Applying Word2vec to protein domains, making the assumption that
multi-domain proteins are sentence-like, reveals that domains display
some manner of semantic or lexical structure. Given this, it should be
possible in future to elucidate statistical or semantic rules for domain
placement in multi-domain proteins using grammatical inference
methods. This would have applications in protein design and
modeling.
The Word2vec algorithm was designed to work over very large
corpuses of human language, and while the 9 million eukaryotic InterPro
sequences used in this study is a relatively large corpus, the corpus
of “sentences” currently has too sparse a level of GO annotation
to allow us to develop a high-quality embedding of word-tokens
which maps well to GO term defined function. A further limitation lies
in the amount of domain coverage. Nearly, 70% of the proteins
remain unassigned to domains and without greater domain coverage a
truly robust domain embedding may not be possible. Additionally,
multi-domain proteins typically have fewer than six domains, and
often just two or three, whereas human sentences comprise longer
sequences. This may mean sequential sets of domains are unlikely to
provide sufficient contextual information to produce an informative
vector embedding. All these issues might be addressed by retuning
the Word2vec model to make it more appropriate for domain data.
Word2vec offers several trainable parameters which may allow the
method to be adapted for better performance with protein domains,
however, it may be the case that an entirely different architecture will
be needed.
Using GO annotations to annotate domains is necessarily noisy. It
is not clear that they are the best way to encode the lexical “meaning”
of an isolated domain in its multi-domain context. In future, a finer
grained annotation of domains' sub-functional roles will be necessary
FIGURE 8 Diagram of intra/extra-cellular domain swaps. Both proteins share Pfam domain PF00179. In protein I3L0A0 domain PF10520 has
been assigned the Gene Ontology (GO) extracellular GO term (GO:0005615). In protein G7Y5H3 the substituted domains, PF014699 and
PF14701, are both labeled with the intracellular GO term (GO:0005622) [Color figure can be viewed at wileyonlinelibrary.com]
0
100
200
300
400
500
500 100 150
Number of GO terms assigned
Frequency
FIGURE 9 Frequency of the
number of Gene Ontology terms
assigned to domains of unknown
functions
BUCHAN AND JONES 623
to correctly interpret the lexical meaning of arithmetic transformations
of vectors in the embedding space. Nevertheless, this work does
open up the tantalizing possibility that protein domains have contextual
lexical meaning that could be learned and in turn could be used to
derive rules for multidomain protein evolution. However, even in light
of these limitations the vector embedding allows us to suggest preliminary
function roles for many, as yet, unannotated Pfam domains, and
combined with other sources of functional information, this could help
improve our overall ability to assign functions to proteins and the
genes which encode them.
4.1 | Code and data
All code is available on GitHub and the domain assignments, genism
model, token distance matrix and DUF assignments are available via
our webserver:
https://github.com/psipred/domain_word2vec_scripts
https://bioinfadmin.cs.ucl.ac.uk/downloads/word2vec/.
ORCID
Daniel W. A. Buchan https://orcid.org/0000-0001-7391-4696
REFERENCES
1. Mikolov, T., Chen, K., Corrado, G. Dean, J. Efficient Estimation of Word
Representations in Vector Space. arXiv: 1301.3781v1, 2013.
2. Goldberg, Y. & Levy, O., word2vec Explained: Deriving Mikolov Et al's
Negative-Sampling Word-Embedding Method. arXiv: abs/1402.3722 2014.
3. Asgari E, Mofrad MR. Continuous distributed representation of biological
sequences for deep proteomics and genomics. PLoS One.
2015;10(11):e0141287.
4. Yang KK, Wu Z, Bedbrook CN, Arnold FH. Learned protein embeddings
for machine learning. Bioinformatics. 2018;34(15):2642-2648.
5. Viehweger A, Krautwurst S, Parks DH, König B, Marz M. An encoding
of genome content for machine learning. Biorxiv. 2019. https://doi.
org/10.1101/524280.
6. Finn RD, Coggill P, Eberhardt RY, et al. The Pfam protein families
database: towards a more sustainable future. Nucleic Acids Res. 2016;
44(D1):D279-D285.
7. Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. SCOP2
prototype: a new approach to protein structure mining. Nucleic Acids
Res. 2014;42(Database issue):D310-D314.
8. Cheng H, Schaeffer RD, Liao Y, et al. ECOD: an evolutionary classification
of protein domains. PLoS Comput Biol. 2014;10(12):e1003926.
9. Dawson NL, Lewis TE, Das S, et al. CATH: an expanded resource to
predict protein function through structure and sequence. Nucleic
Acids Res. 2017;45(D1):D289-d295.
10. Das S, Oregngo CA. Protein function annotation using protein domain
family resources. Methods. 2016;93:24-34.
11. Nepomnyachiy S, Ben-Tal N, Kolodny R. Complex evolutionary footprints
revealed in an analysis of reused protein segments of diverse
lengths. Proc Natl Acad Sci U S A. 2017;114(44):11703-11708.
12. Friedberg I. Automated protein function prediction—the genomic
challenge. Brief Bioinform. 2006;7(3):225-242.
13. Watson JD, Laskowski RA, Thornton JM. Predicting protein function from
sequence and structural data. Curr Opin Struct Biol. 2005;15(3):275-284.
14. Loewenstein Y, Raimondo D, Redfern OC, et al. Protein function
annotation by homology-based inference. Genome Biol. 2009;10
(2):207.
15. Radivojac P, Clark WT, Oron TR, et al. A large-scale evaluation of
computational protein function prediction. Nat Methods. 2013;10(3):
221-227.
16. Consortium GO. Expansion of the gene ontology knowledgebase and
resources. Nucleic Acids Res. 2017;45(D1):D331-d338.
17. Cozzetto D, Buchan DWA, Bryson K, Jones DT. Protein function prediction
by massive integration of evolutionary analyses and multiple
data sources. BMC Bioinformatics. 2013;14(Suppl 3):S1.
18. Lan L et al. MS-kNN: protein function prediction by integrating multiple
data sources. BMC Bioinformatics. 2013;14(Suppl 3):S8.
19. Goldberg T, Hecht M, Hamp T, et al. LocTree3 prediction of localization.
Nucleic Acids Res. 2014;42(Web Server issue):W350-W355.
20. Khan IK, Wei Q, Chapman S, KC DB, Kihara D. The PFP and ESG protein
function prediction methods in 2014: effect of database updates
and ensemble approaches. Gigascience. 2015;4:43.
21. Almeida-e-Silva DC, Vencio RZ. SIFTER-T: a scalable and optimized
framework for the SIFTER phylogenomic method of probabilistic protein
domain annotation. Biotechniques. 2015;58(3):140-142.
22. Van Landeghem S et al. Exploring biomolecular literature with EVEX:
connecting genes through events, homology, and indirect associations.
Adv Bioinformatics. 2012;2012:582765.
23. Falda M et al. Argot2: a large scale function prediction tool relying on
semantic similarity of weighted Gene Ontology terms. BMC Bioinformatics.
2012;13(Suppl 4):S14.
24. Das S, Lee D, Sillitoe I, Dawson NL, Lees JG, Orengo CA. Functional
classification of CATH superfamilies: a domain-based approach for
protein function annotation. Bioinformatics. 2016;32(18):2889.
25. Fang H, Gough J. A domain-centric solution to functional genomics
via dcGO predictor. BMC Bioinformatics. 2013;14(Suppl 3):S9.
26. Finn RD, Attwood TK, Babbitt PC, et al. InterPro in 2017-beyond protein
family and domain annotations. Nucleic Acids Res. 2017;45(D1):
D190-d199.
27. Walsh I, Giollo M, di Domenico T, Ferrari C, Zimmermann O,
Tosatto SCE. Comprehensive large-scale assessment of intrinsic protein
disorder. Bioinformatics. 2015;31(2):201-208.
28. Jones DT. Protein secondary structure prediction based on positionspecific
scoring matrices. J Mol Biol. 1999;292(2):195-202.
How to cite this article: Buchan DWA, Jones DT. Learning a
functional grammar of protein domains using natural language
word embedding techniques. Proteins. 2020;88:616–624.
https://doi.org/10.1002/prot.25842
624 BUCHAN AND JONES