GlimmerM, Exonomy and Unveil: three
ab initio eukaryotic genefinders
William H. Majoros*, Mihaela Pertea, Corina Antonescu and Steven L. Salzberg
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA
Received February 12, 2003; Revised and Accepted March 20, 2003
ABSTRACT
We present three programs for ab initio gene
prediction in eukaryotes: Exonomy, Unveil and
GlimmerM. Exonomy is a 23-state Generalized
Hidden Markov Model (GHMM), Unveil is a 283-state
standard Hidden Markov Model (HMM) and GlimmerM
is a previously-described genefinder which utilizes
decision trees and Interpolated Markov Models
(IMMs). All three are readily re-trainable for new
organisms and have been found to perform well
compared to other genefinders. Results are presented
for Arabidopsis thaliana. Cases have been
found where each of the genefinders outperforms
each of the others, demonstrating the collective
value of this ensemble of genefinders. These programs
are all accessible through webservers at http://
www.tigr.org/software.
INTRODUCTION
Accurate gene structure prediction remains an important
component of genomic annotation efforts, notwithstanding
the increased availability, at least for some organisms, of handcrafted
annotations. Even though ab initio predictions are
typically the least trusted form of evidence used by human
annotators, they are necessary for ensuring high levels of
sensitivity, in some cases being the only form of evidence for a
gene that might otherwise be completely overlooked (1,2).
Unfortunately, genefinder programs for eukaryotes are far
from perfect, often predicting genes or exons which do not
exist, failing to predict those that do exist or generating
predictions having one or more incorrect exon boundaries.
Furthermore, different genefinders trained for the same
organism often produce different predictions. Thus, while a
genefinder is a useful component of the annotation process, it
is difficult to choose a single genefinder to use and genome
annotation systems often employ multiple genefinders in order
to increase the number of predicted coding segments (CDSs).
We present here three genefinder programs, each with
different strengths due to their different algorithmic designs.
While all three programs achieved high levels of accuracy in
our controlled tests, the programs sometimes do not agree on
the structure of a given gene, with each of the programs
occasionally producing a better prediction than the other two.
For this reason, we believe that annotators will appreciate
having access to all three programs for use in their annotation
efforts.
All three programs can be run directly on genomic sequences
by using the http interface at TIGR, available from http://
www.tigr.org/software. This web interface will be useful to
laboratories lacking the computational facilities for running
our UNIX-based software themselves and to those working on
small-scale projects, such as those for sequencing an individual
BAC or specific region of a genome. Sequences <30 kb can be
pasted directly into the browser, whereas larger sequences
<200 kb can be uploaded from a user's system as a FASTA
file. The resulting gene predictions can be displayed directly in
the browser or emailed to the user. GlimmerM is also freely
available for direct download, including source code that
permits re-training the system on any species. (The user must
collect training data to feed to the system.) Exonomy and
Unveil will be made freely available as open source software in
the near future.
The GlimmerM web server (http://www.tigr.org/tdb/glimmerm/
glmr_form.html) can provide gene predictions for several
organisms, including Plasmodium falciparum, P.yoelii, Oryza
sativa (rice), Aspergillus fumigatus, Arabidopsis thaliana,
Theileria parva and Schistosoma mansoni. The Exonomy
and Unveil web servers, at http://www.tigr.org/tdb/Exonomy/
exonomy.html and http://www.tigr.org/tdb/Unveil/unveil.html,
respectively, were initially trained on A.thaliana, but other
organisms will be added in the very near future and will probably
be available at the time of this publication. On the web page, an
organism can be selected using a drop-down list.
DESIGN AND IMPLEMENTATION
All three genefinders utilize statistical models trained from
known genes to evaluate each gene structure. They employ
dynamic programming techniques to search all possible
gene structures in a DNA sequence and efficiently find
the most probable structure, given the statistical model of the
genefinder.
Each genefinder is based on one or more types of Markov
model. The simplest is the nth-order Markov chain (MC),
which calculates the conditional probability of a base, given
*To whom correspondence should be addressed. Tel: +1 301 838 0208; Fax: +1 301 610 5985; Email: bmajoros@tigr.org
# 2003 Oxford University Press Nucleic Acids Research, 2003, Vol. 31, No. 13 3601­3604
DOI: 10.1093/nar/gkg527
Nucleic Acids Research, Vol. 31, No. 13 # Oxford University Press 2003; all rights reserved
the preceding n bases and then multiplies the conditional
probabilities together to arrive at a probability for an entire
subsequence. A variant of this scheme is the Interpolated
Markov Model (IMM), which interpolates between models of
different order, based on the amount of evidence available for a
given probability estimate (3). Nonstationary Markov chains
(NSMC) utilize two or more MCs in strict order while allowing
the individual chains making up the NSMC to partition the
sequence in any way to maximize the joint probability. Threeperiodic
Markov Chains (3PMC) utilize three MCs in a cycle,
one for each frame within a putative coding region. Hidden
Markov Models (HMM) are state-based models in which
each state emits a single base with fixed probability and
then transitions to another state with fixed probability. A
Generalized Hidden Markov Model (GHMM) is an HMM in
which each state can emit a sequence of bases rather than a
single base and which allows explicit modeling of exon and
intron lengths.
Models such as these, when used to evaluate variable-length
DNA sequences, are referred to as content sensors. Other
content sensors that we employ include codon bias models and
decision trees. For fixed-length features such as splice sites and
start/stop codons we employ several types of signal sensor,
including Weight Matrices (WMM), also called Position
Weight Matrices (PWM), Weight Array Matrices (WAM),
Windowed Weight Array Matrices (WWAM), MCs and
Maximal Dependence Decomposition (MDD) trees. Detailed
treatments of content and signal sensors can be found in the
literature (4­6).
Some general features of our three genefinders are as
follows:
 They can be configured to predict only complete gene
structures or to allow partial genes.
 They are retrainable on new organisms without needing to
recompile the program source code.
 They predict genes on either or both strands during a single
pass over the DNA sequence.
 They do not attempt to predict UTRs (untranslated regions),
though some of them model UTRs internally.
 They currently generate a single structure for each predicted
gene (no alternative splicing capability).
GlimmerM
The basis of GlimmerM is a dynamic programming
algorithm that considers all combinations of possible exons
for inclusion in a gene model (by gene model we mean
the protein coding portion of a gene or CDS) and chooses
the best of all these combinations by using IMMs trained
on complete coding regions (7). The decision about what
gene model is best is a combination of the `strength' of the
splice sites and that of the potential coding regions, as we
describe below.
The splice site predictor algorithm in GlimmerM (8) first
uses MDD (9) and first-order MCs to capture dependencies
among neighboring bases in a small window around each
splice junction (16 and 29 bp around the donor and acceptor
sites, respectively). The algorithm then takes advantage of the
fact that the coding and non-coding sequences switch at the
splice junction and tries to detect this switch with two secondorder
MCs, one which models coding sequence and another
that models non-coding sequence. The length of each of these
coding or non-coding context windows is currently fixed at
80 bp. By keeping only those splice sites having the maximal
score within a 60 bp window, many false positives are
eliminated.
Potential coding regions are evaluated by a scoring function
based on decision trees that estimate the probability that a
DNA subsequence is coding. Subsequences are evaluated
according to their putative type: intron, initial exon,
internal exon, final exon and single-exon gene. Each such
subsequence is run through 10 different decision trees built
with the OC1 system (10). The probabilities obtained with
the decision trees are averaged to produce a smoothed
estimate of the probability that the given subsequence is of a
certain type.
A putative gene model is then accepted only if the IMM
score for the coding sequence in the correct reading frame
exceeds a fixed threshold.
Unveil
Unveil is a 283-state HMM based on the VEIL design (11). In
our implementation the actual topology and number of states in
the HMM are specified in a configuration file to afford some
flexibility for future enhancements. When the model is loaded
at run time, the program automatically reverse-complements
this model to allow simultaneous prediction on the reverse
strand. The overall structure currently in use is depicted
schematically in Figure 1A. The 141 states comprising the
forward-strand HMM are shown as being grouped into eight
distinct submodels, each of which can be trained individually
using the Baum­Welch EM algorithm (12) and automatically
combined into a composite HMM. Thenceforth the HMM is
treated as a single, complete model for the purpose of optimal
path analysis.
Figure 1. Simplified structure of two Markov model genefinders: (A) Unveil
and (B) Exonomy.
3602 Nucleic Acids Research, 2003, Vol. 31, No. 13
Once the HMM and the subject sequence are loaded, the
program generates a single set of non-overlapping gene
structure predictions by applying the Viterbi algorithm (13) to
find the most probable path through the states of the HMM
that would emit the subject sequence, given the HMM's
emission and transition probabilities. Bases emitted by coding
states according to this optimal path are grouped into exons
and similarly for intron states and introns, allowing the exons
to then be grouped into genes. Coding frames are explicitly
tracked to ensure that exons within a gene are in the same
open reading frame. Viterbi decoding is achieved via an
efficient dynamic programming implementation on a sparse
graph.
The HMM currently in use was designed to model 50
and 30
UTRs, hexamer frequencies in introns, 7 bp consensus regions
around splice sites, noncanonical start/stop codons and frame
effects within exons. HMM states matching putative exons do
not differentiate between the different exon types (e.g. initial,
final).
The output format is GFF (General Feature Format, see
http://www.sanger.ac.uk/Software/GFF).
Exonomy
Exonomy is a 23-state GHMM similar to Genscan (9) and
Genie (14). The generalized nature of the GHMM framework
provides greater flexibility over an HMM for tuning and
augmenting the genefinder, because entire gene features such
as exons and splice junctions are handled by individual states
in the GHMM, which can be independently retrained,
reparameterized, or replaced by other types of model to
improve performance, without needing to recompile the
program's source code.
For its content-sensor states Exonomy provides MCs, IMMs,
3PMCs, NSMCs and codon bias estimation. For its signal
sensors it provides WMMs, WAMs, WWAMs, MCs and MDD
trees having any of the former as leaf models. Splice sites and
start and stop codons are identified using MDD trees and
variable-length context sequences, as in GlimmerM. Currently,
the overall state architecture of Exonomy is fixed into a
23-state topology which includes submodels for introns,
intergenic segments, single-exon genes and initial-, internaland
final-exons (Fig. 1B).
Exonomy begins by identifying possible splice sites and start
and stop codons. It then constructs a graph containing a vertex
for each splice site or start/stop codon and an edge for each
exon, intron or intergenic segment. Frame constraints are
applied to reduce the size of the graph and then submodels
specified in a configuration file are dynamically loaded and
applied to evaluate the individual vertices and edges in the
graph. A dynamic-programming algorithm is then used to
rapidly find the highest-probability path, which corresponds to
the optimal gene model. Exonomy's output format is GFF.
RESULTS AND DISCUSSION
Prediction accuracy of our genefinders was assessed by
training and testing all three programs on a large set of fulllength
cDNAs from A.thaliana, all of which have been mapped
to the genome (15). The programs were all trained on the same
set of 8500 cDNAs and then tested on 300 sequences each
containing one gene (also from the cDNA dataset) plus a
margin of 100 bp on either side of the start and stop codons.
This rather artificial test scenario is not intended to serve as a
basis for a comprehensive comparison of genefinders but
merely to demonstrate accuracy comparable to that of more
established programs.
Predictions were scored based on nucleotide accuracy (how
accurately each base was classified as coding versus noncoding);
sensitivity and specificity of exons predicted exactly;
and percentage of genes predicted exactly (i.e. consisting
entirely of exactly predicted exons). We also provide results of
running a version of Genscan trained for A.thaliana (16) on the
same test set for comparison. Because we were not able to
retrain Genscan on the training set used by the other programs,
the comparison is imperfect. Results are shown in Table 1.
Table 1 shows that there is much room for improvement by all
four genefinders. Not only was there variation in performance
among the different programs, but the variation was different
for each of the four measures of accuracy. In terms of numbers
of perfectly correct gene predictions, Unveil exhibited the best
performance and Genscan the worst, whereas according to exon
sensitivity and specificity, Genscan performed the best,
followed by Unveil. More importantly, the numbers show that
all three of our genefinders perform quite respectably according
to one or more measures of accuracy. More detailed comparison
of the predictions on specific test sequences reveals that there
are many cases of each of our genefinders outperforming the
other three, as exemplified in Figures 2­4. Thus, were any of
these genefinders to be excluded from an annotation system,
overall ab initio prediction accuracy could be expected to suffer
as a result.
It should be noted that the four accuracy measures reported
involve very different levels of discrimination, as the baseline
nucleotide accuracy is expected to be very high for an
Table 1. Accuracy comparison among four genefinders for 300 genes based
on full-length A.thaliana cDNAs. Genscan was not trained on the same
training set
Program Nucleotide
accuracy
(%)
Exon
specificity
(%)
Exon
sensitivity
(%)
Percentage
of exact
genes (%)
Unveil 94 75 74 46
Exonomy 95 63 61 42
GlimmerM 93 71 71 44
Genscan 94 80 75 27
Figure 2. An example in which Exonomy produces the correct gene model.
Nucleic Acids Research, 2003, Vol. 31, No. 13 3603
organism with relatively low gene density, and as percentage of
exact genes is very stringent (i.e. nearly-correct gene structures
and wildly erroneous ones are all treated simply as errors).
Furthermore, tradeoffs between these measures are possible, as
evidenced by Genscan's high exon accuracy and rather lower
gene accuracy, which situation can readily occur depending on
the grouping of correct and incorrect exons. Nonetheless,
Figures 2­4 demonstrate that imperfect gene predictions often
are very near to being correct.
Note also that due to the possible existence of alternatively
spliced forms in our test and training sets, we cannot be certain
that some of those predictions which were classified as errors
are not actually valid alternative transcripts, though this
consideration should only strengthen the case for continuing
to employ multiple genefinders, in addition to homology
evidence, in any annotation project. Nevertheless, the need for
additional improvements to genefinding techniques is clear.
Our current research includes methods for combining
genefinder predictions using various techniques, such as
machine learning and multiobjective optimization, together
with sequence homology and synteny information.
ACKNOWLEDGEMENTS
This work was supported in part by NSF grants MCB-0114792
and KDI-9980088, and by NIH grant R01-LM06845. The
authors thank Jennifer Wortman, Mark Yandell and Art
Delcher for helpful discussions and the anonymous reviewers
for useful comments.
REFERENCES
1. Das,M., Burge,C.B., Park,E., Colinas,J. and Pellitier,J. (2001)
Assessment of the total number of human transcription units. Genomics,
77, 71­78.
2. Guigo,R., Dermitzakis,E.T., Agarwal,P., Ponting,C.P., Parra,G.,
Reymond,A., Abril,J.F., Keibler,E., Lyle,R., Ucla,C., Antonarakis,S.T.
and Brent,M.R. (2003) Comparison of mouse and human genomes
followed by experimental verification yields an estimated 1019 additional
genes. Proc. Natl Acad. Sci. USA, 100, 1140­1145.
3. Salzberg,S.L., Delcher,A.L., Kasif,S. and White,O. (1998) Microbial
gene identification using interpolated Markov models. Nucleic Acids Res.,
26, 544­548.
4. Salzberg,S.L., Searls,D.B. and Kasif,S. (eds). (1998) Computational
Methods in Molecular Biology. Elsevier, Amsterdam, The Netherlands.
5. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological
Sequence Analysis. Cambridge University Press, Cambridge, UK.
6. Mathé,C., Sagot,M.-F., Schiex,T. and Rouzé,P. (2002) Current methods
of gene prediction, their strengths and weaknesses. Nucleic Acids
Res., 30, 4103­4117.
7. Salzberg,S.L., Pertea,M., Delcher,A.L., Gardner,M.J. and Tettelin,H.
(1999) Interpolated Markov models for eukaryotic gene finding.
Genomics, 59, 24­31.
8. Pertea,M., Lin,X. and Salzberg,S.L. (2001) GeneSplicer: a new
computational method for splice site prediction. Nucleic Acids Res.,
29, 1185­1190.
9. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures in
human genomic DNA. J. Mol. Biol., 268, 78­94.
10. Murthy,S.K., Kasif,S. and Salzberg,S.L. (1994) A system for induction of
oblique decision trees. J. Artificial Intelligence Res., 2, 1­32.
11. Henderson,J., Salzberg,S. and Fasman,K.H. (1997) Finding
genes in DNA with a Hidden Markov Model. J. Comput. Biol.,
4, 127­141.
12. Baum,L.E., Petrie,T., Soules,G. and Weiss,N. (1970) A maximization
technique occurring in the statistical analysis of probabilistic functions of
Markov chains. Ann. Math. Stat., 41, 164­171.
13. Viterbi,A.J. (1967) Error bounds for convolutional codes and an
asymptotically optimal decoding algorithm. IEEE Trans. Information
Theory, 13, 260­269.
14. Reese,M.G., Kulp,D., Tammana,H. and Haussler,D. (2000) Genie--gene
finding in Drosophila melanogaster. Genome Res., 4, 529­538.
15. Haas,B.J., Volfovsky,N., Town,C.D., Troukhan,M., Alexandrov,N.,
Feldmann,K.A., Flavell,R.B., White,O. and Salzberg,S.L. (2002)
Full-length messenger RNA sequences greatly improve genome
annotation. Genome Biol., 3, 1­12.
16. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence
of the flowering plant Arabidopsis thaliana. Nature, 408, 796­815.
Figure 3. An example in which Unveil produces the correct gene model (as
does Genscan).
Figure 4. An example in which GlimmerM produces the correct gene model.
3604 Nucleic Acids Research, 2003, Vol. 31, No. 13