1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Explanation Gn.Ex : gene number, exon number (for reference) Type : Init = Initial
exon (ATG to 5' splice site) Intr = Internal exon (3' splice site to 5' splice site) Term
= Terminal exon (3' splice site to stop codon) Sngl = Single-exon gene (ATG to
stop) Prom = Promoter (TATA box / initation site) PlyA = poly-A signal (consensus:
AATAAA) S : DNA strand (+ = input strand; - = opposite strand) Begin : beginning
of exon or signal (numbered on input strand) End : end point of exon or signal
(numbered on input strand) Len : length of exon or signal (bp) Fr : reading frame
(a forward strand codon ending at x has frame x mod 3). For example, if
nucleotides 1,2,3 of the sequence are read as a codon, that's called reading
frame 0. If 2,3,4 are read as a codon, that's reading frame 1. If 3,4,5 are read as a
codon, that's reading frame 2, and so on. This information, together with the
starting and ending positions of the exon, is sufficient to give the amino acid
sequence encoded by the exon. Another use of the reading frame is that if you
see two adjacent predicted exons separated by a relatively short intron which
share the same reading frame, it may be worth looking at the possibility that the
intervening intron is not correct, i.e. that the two exons plus the intervening
intron might form one long exon (assuming there are no inframe stops in the
intron, of course). Ph : net phase of exon (exon length modulo 3). For example,
an exon of length 15 bp has net phase 0 since 15 is divisible by 3, an exon of
length 16 bp has net phase 1 because 16 divided by 3 leaves a remainder of 1, an
exon of length 17 bp has net phase 2, and an exon of length 18 bp has net phase
0 again. The point of this is that exons whose net phase is 0 can be omitted from
34
the gene without disrupting the reading frame: such exons are candidates for
being either 1) incorrect, or 2) alternatively spliced. I/Ac : initiation signal or 3'
splice site score (tenth bit units; x 10). If below zero, probably not a real acceptor
site. Do/T : 5' splice site or termination signal score (tenth bit units; x 10) If below
zero, probably not a real donor site. CodRg : coding region score (tenth bit units) P
: probability of exon (sum over all parses containing exon). This quantity is close to
the actual probability that the predicted exon is correct. Tscr : exon score
(depends on length, I/Ac, Do/T and CodRg scores).
Comments The SCORE of a predicted feature (e.g., exon or splice site) is a logodds
measure of the quality of the feature based on local sequence properties.
For example, a predicted 5' splice site with score > 100 is strong; 50-100 is
moderate; 0-50 is weak; and below 0 is poor (more than likely not a real donor
site). The PROBABILITY of a predicted exon is the estimated probability under
GENSCAN's model of genomic sequence structure that the exon is correct. This
probability depends in general on global as well as local sequence properties, e.g.,
it depends on how well the exon fits with neighboring exons. It has been shown
that predicted exons with higher probabilities are more likely to be correct than
those with lower probabilities.
What are the suboptimal exons?
Under the probabilistic model of gene structural and compositional properties
used by GENSCAN, each possible "parse" (gene structure description) which is
compatible with the sequence is assigned a probability. The default output of the
program is simply the "optimal" (highest probability) parse of the sequence. The
exons in this optimal parse are referred to as "optimal exons" and the translation
products of the corresponding "optimal genes" are printed as GENSCAN predicted
peptides. (All the data in our J Mol Biol paper and on the other GENSCAN web
pages refer exclusively to the optimal parse/optimal exons.) Of course, the
optimal parse does not always correspond to the actual (biological) parse of the
sequence, that is, the actual set of exons/genes present. In addition, there may be
more than one parse which can be considered "correct", for example, in the case
of a gene which is alternatively transcribed, translated or spliced. For both of
these reasons, it may be of interest to consider "suboptimal" ("near-optimal")
exons as well, i.e. exons which have reasonably high probability but are not
present in the optimal parse. Specifically, for every potential exon E in the
sequence, the probability P(E) is defined as the sum of the probabilities under the
model of all possible "parses" (gene structures) which contain the exact exon E in
the correct reading frame. (This quantity is calculated as described on the
GENSCAN exon probability page.) Given a probability cutoff C, suboptimal exons
are those potential exons with P(E) > C which are not present in the optimal parse.
34
Suboptimal exons have a variety of potential uses. First, suboptimal exons
sometimes correspond to real exons which were missed for whatever reason by
the optimal parse of the sequence. Second, regions of a prediction which contain
multiple overlapping and/or incompatible optimal and suboptimal exons may in
some cases indicate alternatively spliced regions of a gene (Burge & Karlin, in
preparation). The probability cutoff C used to determine which potential exons
qualify as suboptimal exons can be set to any of a range of values between 0.01
and 1.00. The default value on the web page is 1.00, meaning that no suboptimal
exons are printed. For most applications, a cutoff value of about 0.10 is
recommended. Setting the value much lower than 0.10 will often lead to an
explosion in the number of suboptimal exons, most of which will probably not be
useful. On the other hand, if the value is set much higher than 0.10, then
potentially interesting suboptimal exons may be missed.
34
35
36
37
38
39
40
41
42
43
44
45
46
Genomic organization of the Capitella sp. I Hox cluster. A total of 11 Capitella sp. I
Hox genes are distributed among three scaffolds. Black lines depict two scaffolds,
which contain 10 of the Capitella sp. I Hox genes. The eleventh gene, CapI-Post1,
is located on a separate scaffold surrounded by ORFs of non-Hox genes
(unpublished data). No predicted ORFs were identified between adjacent linked
Hox genes. Transcription units are shown as boxes denoting exons, connected by
lines that denote introns. Transcription orientation is denoted by arrows beneath
each box. Color coding is the same as that used in on the right-hand side for each
ortholog.
The phylogenic tree on the right-hand side shows that the order of the genes on
the chromozome is retained in several species (genome colinearity).
47
48
49
50
51
52
53
54