[15:13 30/7/2010 Bioinformatics-btq358.tex] Page: 2109 2109–2115
BIOINFORMATICS ORIGINAL PAPER
Vol. 26 no. 17 2010, pages 2109–2115
doi:10.1093/bioinformatics/btq358
Sequence analysis Advance Access publication July 11, 2010
An alignment-free method to identify candidate orthologous
enhancers in multiple Drosophila genomes
Manonmani Arunachalam1,2, Karthik Jayasurya2, Pavel Tomancak1,∗ and Uwe Ohler2,∗
1Max Planck Institute for Molecular Cell Biology and Genetics, Dresden, Germany and 2Institute for Genome
Sciences and Policy, Duke University, Durham, NC, USA
Associate Editor: David Posada
ABSTRACT
Motivation: Evolutionarily conserved non-coding genomic
sequences represent a potentially rich source for the discovery
of gene regulatory region such as transcriptional enhancers.
However, detecting orthologous enhancers using alignment-based
methods in higher eukaryotic genomes is particularly challenging,
as regulatory regions can undergo considerable sequence changes
while maintaining their functionality.
Results: We have developed an alignment-free method which
identiﬁes conserved enhancers in multiple diverged species. Our
method is based on similarity metrics between two sequences
based on the co-occurrence of sequence patterns regardless
of their order and orientation, thus tolerating sequence changes
observed in non-coding evolution. We show that our method is
highly successful in detecting orthologous enhancers in distantly
related species without requiring additional information such as
knowledge about transcription factors involved, or predicted binding
sites. By estimating the signiﬁcance of similarity scores, we are
able to discriminate experimentally validated functional enhancers
from seemingly equally conserved candidates without function. We
demonstrate the effectiveness of this approach on a wide range
of enhancers in Drosophila, and also present encouraging results
to detect conserved functional regions across large evolutionary
distances. Our work provides encouraging steps on the way to
ab initio unbiased enhancer prediction to complement ongoing
experimental efforts.
Availability: The software, data and the results used in this article
are available at http://www.genome.duke.edu/labs/ohler/research/
transcription/ﬂy_enhancer/
Contact: tomancak@mpi-cbg.de; uwe.ohler@duke.edu
Supplementary information: Supplementary data are available at
Bioinformatics online.
Received on April 19, 2010; revised on June 25, 2010; accepted on
June 29, 2010
1 INTRODUCTION
The identiﬁcation of functional regulatory elements present in
non-coding DNA sequences is complicated by the fact that such
elements are highly variable and the sequence features are not
sufﬁciently understood. However, functional elements tend to evolve
at slower rates than non-functional regions because they are subject
∗To whom correspondence should be addressed.
to selection. Due to this slower rate of evolution, comparisons among
evolutionarily distant genomes make it possible to use sequence
conservation to identify functional regions in the sea of non-coding
DNA (Blanchette and Tompa, 2002; Wang and Stormo, 2003). The
performance of this ‘phylogenetic footprinting’ strategy depends
on the evolutionary distance between given species and on the
conservation level of individual regulatory regions. A number of
methods have used cross-species alignments to identify regulatory
regions or elements (Cliften et al., 2001; Corcoran et al., 2005;
Hardison, 2000; Kellis et al., 2003; Loots et al., 2002); in addition,
standard tools such as BLAST (Altschul et al., 1990) are regularly
used to identify conserved (segments of) enhancers as well (Erives
and Levine, 2004; Hare et al., 2008). The availability of complete
genomes also allows for applying general approaches to detect
regions under selection in a multiple alignment, such as PhastCons
which is based on a phylogenetic hidden Markov model (Siepel and
Haussler, 2004).
Despite many individually successful applications, conservation
based on sequence alignments can be a poor predictor of functional
enhancers (Cliften et al., 2001). In contrast to protein-coding
sequences, functional non-coding sequences are frequently not
constrained in the ordering and number of functional elements
within them (Ludwig et al., 1998; Markstein and Levine, 2002).
Comparative sequence analysis have identiﬁed signiﬁcant smallscale
insertions, deletions and rearrangements of transcription factor
binding sites (TFBSs) within functional modules (Ludwig et al.,
2000; Ludwig, 2002). Tracking the evolutionary path of such noncoding
elements is proving difﬁcult with current alignment-based
methods.Arecent study on the conservation of the best characterized
eukaryotic enhancers of the even-skipped gene, between Drosophila
and the highly diverged sepsid ﬂies shows that BLAST ﬁnds some
short stretches of sequence similarity but scores too low to conﬁrm
the evolutionary relationship between those species (Hare et al.,
2008). It has also been shown that, especially in distantly related
species, the enhancer sequences are simply not alignable (Wolff
et al., 1999).
The basic problem we address here thus poses itself as follows:
given a regulatory region such as an enhancer without any prior
knowledge of functional sites within, can we devise a way to
identify its location in a (distantly) related species? And, given a
candidate region, can we predict its functionality as an enhancer
based on its conservation? With the availability of enormous sets of
uncharacterized putative regulatory regions (e.g. based on assays
detecting regions of open chromatin), methods tailored to these
problems would be immensely useful for providing complementary
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2109
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[15:13 30/7/2010 Bioinformatics-btq358.tex] Page: 2110 2109–2115
M.Arunachalam et al.
evidence, and for studying the evolution of gene regulation at a
larger scale. We address these two questions based on so-called
alignment-free methods. These methods work under the assumption
that similar sequences will share their word (k-mer) composition
to some extent, comparing two sequences based on co-occurring
words regardless of their order and orientation, thus making no
prior assumptions on the presence of particular TFBSs. Vinga and
Almeida (2003) performed an early systematic review on a number
of different word frequency measures to identify similar regions
between two given sequences. Recent studies have proposed and
applied alignment-free methods to regulatory sequence problems:
Kantorovitz et al. (2007) introduced the ‘D2Z’ score and used it
to identify enhancers ab initio for genes with similarly annotated
function in benchmark Drosophila melanogaster data (Ivan et al.,
2008). Sosinsky et al. (2007) combined motif detection in a set of
enhancers with a score allowing for local permutations of motifs
to identify conserved enhancers. Our work is based on van Helden
et al. (2004) who introduced a Poisson-based metric which relies on
probability theory for comparing sequences on the basis of pattern
counts. In the original study, sets of yeast promoters of co-expressed
genes were used to identify overrepresented words. The metric,
restricted to this subset of words, was then successfully applied to
cluster genes based on promoter similarity.
In summary, previous approaches use alignment-free metrics in
the same genome to either compare and cluster promoters (van
Helden et al., 2004), or to identify enhancers for sets of genes
known to share regulatory patterns (Chan and Kibler, 2005; Ivan
et al., 2008; Nazina and Papatsenko, 2003); some also use sets
of enhancers with similar function to ﬁrst determine a subset of
words on which similarity is calculated (Sosinsky et al., 2007; van
Helden et al., 2004). Different from these approaches, we here apply
alignment-free approach on phylogenetic footprinting problems.
Each candidate enhancer is regarded separately, eliminating the
need of any additional information on known or predicted motifs,
or on similarly regulated genes. We extend the Poisson-based
metric to work with multiple related genomes and show that it
successfully identiﬁes the location of enhancer orthologs in distantly
related species. We also show that appropriate signiﬁcance values
allow for separating functional from non-functional candidates, a
task on which other alignment-based methods fail. We evaluate
our approach on different sets of validated enhancers from the
D.melanogaster genome, a well-studied model system for gene
regulation.
2 METHODS AND DATA
The basic outline of our approach is as follows: given a candidate or known
enhancer sequence in one genome, we scan the corresponding intergenic
region in a second genome with a moving window and calculate a similarity
score between the enhancer and the current window. The maximum in the
resulting similarity proﬁle along the intergenic region is then identiﬁed, and
the signiﬁcance of the maximum score is calculated. Below, we explain the
individual steps in detail.
2.1 Identiﬁcation of orthologous intergenic regions
To locate enhancers in related species, we need to ﬁrst identify the
putative orthologous control region. It is well known that orthologous
regulatory sequences show a signiﬁcantly lower level of similarity than their
corresponding coding sequences. In order to deﬁne the search space, we
therefore make the assumption that enhancers will be present in syntenic
regions ﬂanked by orthologous genes, and use best reciprocal BLAST hits to
identify control regions in a related (non-melanogaster) genome. Repeats and
low-complexity regions in the control sequence were masked using DUST
(Altschul et al., 1990); we observed that the identiﬁed intergenic regions for
the considered enhancer sets typically ranged from 5 to 50 kb.
2.2 Background word probabilities
To score the occurrence of different words, we need to deﬁne their baseline
occurrence frequency in the genomes under consideration. We considered
the whole genome from each species as background sequence to compute
the background frequencies. With frequencies (fi) for k-mers from a given
sequence, the expected number of occurrences mi of the pattern i in species
j in a sequence of length L is obtained by
m
j
i =fiT =fi(L−w+1), (1)
where w the length of the pattern and T the number of possible positions.
We ﬁxed the pattern length as 6 and stored the estimated pattern frequencies
in precalculated tables for efﬁciency reasons.
2.3 Poisson-based metric
Following the successful example of van Helden et al. (2004), we adapted
Poisson-based metrics to compare sequences from two different species.
The metric considers word counts (in our case, all k-mers) together with their
expected probabilities. It relies on the assumption that pattern occurrences are
Poisson distributed and independent of each other. For two given sequences,
a Poisson-based similarity score S is calculated based on the probability of
common occurrences. Sequence divergence is reﬂected in a Poisson-based
dissimilarity score D which reﬂects the difference between word occurrences
in the two sequences. Finally, a mixed metric M is a score that is deﬁned as
weighted combination of the similarity and dissimilarity score.
In our context, we evaluate a given enhancer sequence e from species 1
against a candidate orthologous regulatory control region c from species 2.
The Poisson-based similarity score Sec
i for a single word i from the two
sequences is calculated as follows:
Sec
i =[1−P1(x≥Cec
i )]∗[1−P2(x≥Cec
i )] (2)
where Cec
i is the common count for word i in the sequence e and c:
Cec
i =min(Ne
i ,Nc
i ) (3)
Ne
i ,Nc
i are the number of times the word i appears in e and c and
Pj(x≥Cec
i ) is the probability of observing at least Cec
i in e and c from
species 1 and 2, respectively.
Pj(x≥Cec
i )=
[1−F(Cec
i −1,m
j
i)] if Cec
i >0
1, otherwise
(4)
where the Poisson distribution function F(Cec
i −1,m
j
i) gives the
probability to observe ≤Cec
i occurrences when the expected value is m
j
i. The
multivariate similarity Sec between e and c, over p words, is then obtained by
Sec
=(1/p) Sec
i (5)
The dissimilarity Dec
i is calculated as follows:
Dec
i =|F(Ne
i −1,m1
i )−F(Nc
i −1,m2
i )| (6)
and the multivariate dissimilarity Dec across all p words is thus obtained by
Dec
=
1
p
Dec
i (7)
Finally, the mixed metric Mec is deﬁned as
Mec
=Sec
−aDec
+b, (8)
where Sec is the similarity from Equation (5) and Dec is the dissimilarity
between e and c from Equation (7), a is a positive weighting parameter which
2110
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[15:13 30/7/2010 Bioinformatics-btq358.tex] Page: 2111 2109–2115
Conserved enhancer detection using an alignment-free method
can be tuned to give more emphasis on the common or distinct occurrences
between two sequences, and b is an offset which ensures that the metric is
always positive. When two sequences have exactly the same counts for all
the patterns, their dissimilarity is 0. This score increases with the number
of distinct counts. For any two sequences, if the pattern is found in both
sequences, then the mixed metric is positive, which implies that the given
sequences are similar to each other. The mixed metric contains a negative
contribution for the dissimilar sequences.
2.4 Obtaining a background score distribution
To assess the signiﬁcance of a mixed metric score, we computed an empirical
background distribution by sampling random intergenic sequences from
the D.melanogaster and a related genome. Starting from D.melanogaster
sequences >10 kb in length, we identiﬁed the orthologous sequences in
the distantly related species (D.ananassae, D.pseudoobscura, D.willistoni,
D.mojavensis, D.virilis and D.grimshawi). We ﬁrst picked a random
orthologous pair, and then sampled windows of equal length at random
locations in each region, and computed the mixed metric between the
windows [using Equation (8)]. The ﬁrst and last 500 bp were excluded to
remove effects of potential proximal and core promoter sequences which
show generally elevated conservation levels. Repeating this procedure
10 000 times established a background mixed metric score distribution of
two arbitrary sequence windows from orthologous intergenic regions. We
used this distribution to determine score threshold values corresponding
to speciﬁc signiﬁcance P-values (typically, P ≤ 0.05 or 0.01). This step is
important to distinguish background sequence conservation, which can be
high for compact genomes such as D.melanogaster, from regions which show
signiﬁcantly high scores based on the Poisson distance metric.
2.5 Scanning a known enhancer against a control
region
To locate an enhancer in an orthologous genome, a known enhancer is
scanned against the regulatory control region from a related species in a
sliding window fashion. As we do not know the size of the orthologous
enhancer, we select a ﬁxed window size in the orthologous region and
shift the window by 50-bp steps. Using the background word frequencies
from the two species, the mixed metric score is computed for each window
using Equation (8). The result of this scanning process is a proﬁle along
the region of the related species. We applied the same procedure on the
reverse complement sequence of the control region. The window with global
maximum mixed metric score on either strand is considered as candidate
orthologous enhancer region in the related species. As the true enhancer
length may exceed the chosen window size, we merge windows consecutive
to the window with the best score if their similarity score exceeds the given
threshold value deﬁned by the background score distribution. Orthologous
control regions differ in length and typically exceed the window size, and
we correct for multiple hypothesis testing in the P-value calculation to take
the sequence length into account. A Bonferroni-style corrected P-value is
obtained by
Corrected P-value = [ P-value * (Sequence length / window size) ] (9)
where P-value is the signiﬁcance threshold value obtained from the
background, and the sequence length is the size of the control orthologous
sequence. Note that we approximate the correction factor by the number
of non-overlapping windows, as the scores of neighboring overlapping
windows will be highly correlated and do not constitute independent tests.
2.6 Combining scores from multiple pairwise scans
For each known enhancer region in D.melanogaster, we can identify
similar sequences in multiple diverged species by comparing each one
individually to D.melanogaster in a pairwise fashion. To gain conﬁdence
in predicting functionality based on good similarity scores, we combine
these pairwise scores. In order to compare multiple species, we need to
take the evolutionary distance and relationship into account. For this, we
considered available evolutionary distance information and normalized all
branch lengths used in a particular combination of species to sum up to 1.
We then compute a phylogenetic-tree-reweighting score (PRS) based on the
normalized path length between the two compared species (the phylogenetic
tree and examples for reweighting are shown in the Supplementary Material).
Pairwise PRSs are used as weights to compute a combined P-value across
all species j as
Weighted P-value =
j
(PRS * Corrected P-value) (10)
This weighted P-value thus indicates the signiﬁcance of the predicted
orthologous enhancer region across multiple non-melanogaster genomes. We
limited our analyses to six species outside the melanogaster subgroup, as the
branch lengths within the subgroup are comparably short and would not
contribute much to a combined score.
2.7 Datasets
To evaluate our approach, we used several datasets from Drosophila and
non-Drosophila insect species, with varying degrees of information about
functionality of enhancers outside D.melanogaster.
(a) Aset of 37 regions from D.melanogaster predicted to act as enhancers
for anterior/posterior (A/P) embryonic patterning were compiled from
Berman et al. (2002). These regions exhibited unusually high densities
of predicted binding sites for the early-active transcription factors
Bicoid, Hunchback, Kruppel, Knirps and Caudal. In a follow-up study,
the same group carried out an evolutionary analysis of these enhancers
based on comparisons of the D.melanogaster and D.pseudoobscura
genomes (Berman et al., 2004). All 37 candidates were conserved
based on the alignment analysis in D.pseudoobscura, and 33 of them
were experimentally tested; yet, only 15 enhancers drove expression
along A/P axis whereas the other 18 candidates did not. This data is
therefore an ideal dataset to evaluate our approach regarding its ability
to predict functionality of enhancer candidates; those experimentally
validated enhancer regions that drive expression patterns form a
positive set, and the enhancers that do not drive expression are
considered as negative set.
(b) As a large-scale dataset, we used a well-annotated collection of known
cis-regulatory modules (CRMs) from the Drosophila genome that
are deposited in the REDﬂy database (Gallo et al., 2006). Release 5
contained 728 CRMs recorded along with information on the gene
expression pattern driven by each CRM. We considered 421 CRMs
with a sequence length between 250 and 2000 bp for our analysis
(this excludes short modules for which our approach mostly is not
applicable). This dataset contains regulatory sequences from nearly
all stages of embryogenesis and some post-embryonic stages. For
instance, it includes CRMs that drive ectoderm expression (CNS,
PNS, SNS, trachea and epidermis), or enhancers that drives part
ectodermal (foregut, hindgut, salivary glands and Malpighian tubules)
and endodermal (midgut) expression, and it contains enhancers
for mesoderm and its major derivatives (visceral musculature, fat
body, dorsal vessel and somatic musculature). There is no curated
information about enhancer locations outside of D.melanogaster, and
is thus used to evaluate whether our approach makes signiﬁcant
predictions across a wider range of expression patterns exceeding
the well-annotated segmentation network enhancers above.
(c) Finally, in a recent study on the conservation of well-characterized
eve enhancers between Drosophila and Sepsid genomes, it was shown
that enhancers from distantly related genomes (>100 Mio years) can
still produce identical expression patterns despite almost completely
rearranged binding sites (Hare et al., 2008). Six sepsid species
from three families which have well-characterized phylogenies were
2111
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[15:13 30/7/2010 Bioinformatics-btq358.tex] Page: 2112 2109–2115
M.Arunachalam et al.
considered for this analysis. Predicted orthologous enhancers in sepsid
genomes were based on alignment methods such as BLASTZ and
contained a small number of short (20–30 bp) sequences conserved
between sepsids and Drosophilids. The expression patterns of eve
genes in sepsid genomes were experimentally validated using in situ
hybridization and conﬁrmed the characteristic expression pattern of
seven transverse stripes. We used this dataset to test whether our
method is able to detect orthologous enhancers in sepsid genomes
which is signiﬁcantly more diverged from D.melanogaster than any
Drosophila species, and for which standard alignment methods break
down due to the signiﬁcant divergence of linear sequence similarity.
With sufﬁciently large set of known enhancers, the parameters (i) word
length, (ii) window size and (iii) mixed metric weighting can be automatically
trained to achieve optimal results. In our case, the datasets with known
orthologous ﬂy enhancers are relatively small, and the larger sets generally
lack information about orthologs. For all experiments, we therefore ﬁxed the
word length as 6 and the window size as 500, reﬂecting commonly made
choices in the literature, and set the weighting parameter a = 0.1 to reﬂect
our stronger interest in shared rather than differing word occurrences.
3 RESULTS
3.1 The alignment-free method correctly predicts
enhancers that drive A/P pattern during Drosophila
embryogenesis
To test its effectiveness, we ﬁrst applied our method on a
set of 33 annotated enhancer candidates for A/P patterning
during Drosophila embryogenesis. According to alignment-based
methods, all candidates were conserved in D.pseudoobscura; yet,
experiments showed that only 15 of these candidates were indeed
enhancers driving an A/P expression pattern, while 18 were
non-functional. Our task was therefore ﬁrst to identify the best
location of these candidates in the related genomes, and then to
computationally distinguish the real candidates from false ones
using our alignment-free method. We ﬁrst identiﬁed orthologous
intergenic regions (cf. Section 2.1) from non-melanogaster genomes
(D.ananassae, D.pseudoobscura, D.willisoni and D.mojavensis,
D.virilis, D.grimshawi), in which the orthologous enhancers were
expected to be located. We scanned a known enhancer from
D.melanogaster against a moving window of ﬁxed size (500 bp)
along the corresponding orthologous intergenic region (with the
mixed metric weight a set to 0.1; cf. Section 2.5). Since
functional enhancers can be located on the forward or the reverse
strand, we computed the mixed metric along both strands of the
entire orthologous intergenic region. The window with the global
maximum mixed metric score provided us with the information
of the location of orthologous enhancer regions. As an example,
Figure 1 shows the result of scanning the well-known enhancer
Eve_stripe_3/7 against the orthologous intergenic region from
D.pseudoobscura. It is evident that there is one clear maximum,
at the location of the annotated enhancer in D.pseudoobscura, and
that this maximum exceeds the signiﬁcance threshold.
After all pair-wise comparisons between the reference species
D.melanogaster and the 6 non-melanogaster genomes, we computed
a weighted P-value from the corrected P-values from the pairwise
scans, taking the evolutionary tree into account (Section 2.6). Our
method was successful in separating functional from non-functional
enhancer candidates: 14 of the 15 positive candidates received
weighted P-values <0.05, with the remaining receiving a borderline
Fig. 1. Scanning the Eve_stripe_3/7 enhancer against the orthologous
intergenic region from D.pseudoobscura. The X-axis indicates the window
position, and the corresponding mixed metric score at each position is
shown. Forward and reverse strand scanning results are shown with solid and
dotted lines, respectively. The horizontal line represents an adjusted P-value
threshold of 0.05 as obtained from the background distribution (Section 2.4).
The orthologous enhancer region in D.pseudoobscura based on Berman’s
analysis is indicated as a bar.
value just above the threshold (Fig. 2a). In turn, 16 out of 18
negative candidates clearly exceeded signiﬁcant P-values (all >0.1).
For all positive candidates, the location of enhancers predicted by
our approach overlapped with the locations as annotated in Berman
et al. (2004).
The weighted P-values of most of the negative candidates
indicates that those are in fact not conserved based on the Poisson
similarity metric, which implies those candidates are non-functional
regions. However, there were two candidates (CE8021 and CE8023)
in which the orthologous regions were identiﬁed with signiﬁcant
weighted P-values in all six non-melanogaster species (cf. Fig. 2a).
These candidates are as well conserved in distantly related species
as the functional enhancers, and given the clear separation we
observe for the remaining negative set, may therefore be functional
in Drosophila embryo development, despite the fact that they could
previously not be experimentally validated (Berman et al., 2004).As
a ﬁrst step toward validation, at least one of the genes ﬂanking the
candidate should be expressed in an A/P-like pattern similar to other
validated enhancers. The candidate CE8023 is located in between
the ‘Dfd’and ‘Ama’genes. TheAtlas of Patterns of Gene Expression
during Drosophila Embryogenesis (APOGEE; Tomancak et al.,
2007) has in situ hybridization image data on Dfd, and its expression
indeed shows patterning along the A/P axis (Fig. 3). While this does
not conﬁrm that the precise region we predicted is the functional one,
it indicates that the Dfd genomic locus must contain an enhancer such
as the one we predicted. The other candidate, CE8021, is located 7 kb
upstream of the ‘reaper’ gene. Unfortunately, the expression pattern
of reaper is not yet available in APOGEE.
3.2 Performance evaluation compared with
alignment-based approaches
To put the performance of our alignment-free method in context,
we compared it to general purpose alignment-based methods that
equally do not require knowledge of binding site information. We
performed the sequence comparison for all 33 predicted enhancers
2112
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[15:13 30/7/2010 Bioinformatics-btq358.tex] Page: 2113 2109–2115
Conserved enhancer detection using an alignment-free method
Fig. 2. Performance of different methods in detecting enhancers based on conservation across multiple species. (a) Alignment-free approach; (b) BLAST;
(c) phastCons. The positive candidates are indicated in black bars and negatives are represented in gray.
Fig. 3. Genomic region containing the putative enhancer CE8023 and the
Dfd gene on chromosome 3R [image extracted from Flybase (Ashburner and
Drysdale, 1994)]. Box in the right side and middle indicate the location of
Dfd and CE8023. An image showing the expression pattern of Dfd taken
from the in situ database is shown.
from D.melanogaster using BLAST (combined pairwise scores
weighted as in our approach) or the average phastCons value from
its track in the UCSC genome browser (Siepel and Haussler, 2004);
the results are shown in Figure 2b and c, respectively. Compared
with Figure 2a, it is clear that our alignment-free method performed
signiﬁcantly better than BLAST and phastCons, which largely failed
to discriminate the positive and the negative enhancers. It therefore
appears that the alignment-free metrics reﬂect the particular patterns
of enhancers (shared word occurrences, but possibly in different
order and/or orientation) more adequately than approaches to detect
conserved regions in general, and which assume linearity over longer
stretches of sequence.
3.3 Evaluation on additional datasets
To demonstrate that the applicability extended beyond the wellstudied
A/P patterning, we considered two additional datasets. We
ﬁrst evaluated a well-studied and experimentally validated set of
conserved enhancers that control dorsal/ventral (D/V) patterning of
Drosophila embryogenesis (Papatsenko and Levine, 2005). Results
were very similar to the A/P set (see Supplementary Material for
details), and our method successfully identiﬁed the location of
nearly all enhancers in divergent genomes, including secondary or
‘shadow enhancers’ (Hong et al., 2008). Both A/P and D/V datasets
concerned regulatory regions which deﬁne expression patterns in
the early Drosophila blastoderm. To evaluate our approach on a
larger set of regions with a broader range of expression patterns,
we examined enhancers from a large collection of experimentally
veriﬁed enhancers in D.melanogaster taken from the REDﬂy
database. In total, 421 enhancers were considered; for some of the
Fig. 4. Result of orthologous enhancer identiﬁcation of REDﬂy candidates
in pairwise scans. The successful predictions are shown in gray (P<0.05),
borderline candidates (0.05 ≤ P < 0.1) are indicated in black, dashed vertical
line indicates regions without signiﬁcant predictions and white shows
candidates for which we could not assign orthologous ﬂanking genes.
enhancers we could not identify orthologs of the ﬂanking genes in
a related species, and thus no syntenic intergenic region to analyze.
Using the alignment-free method, we ﬁrst performed a
systematic pairwise comparison between the reference genome
(D.melanogaster) and the other non-melanogaster species (Fig. 4).
For instance, for 73% of candidates we were able to identify
a candidate orthologous enhancer with signiﬁcant P-value in
D.ananassae; 4% had borderline signiﬁcance values between 0.05
and 0.1; for 12% of enhancers we were not able to identify the
orthologous ﬂanking genes of the corresponding enhancers; and we
did not make any signiﬁcant prediction for 11%. Overall, we reached
the highest fraction of signiﬁcant predictions in D.pseudoobscura
(76%), and in more distantly related species ( D.mojavensis,
D.virilis and D.grimshawi) we made predictions for at least 60%
of enhancers.
While the fraction of non-orthologous ﬂanking genes scaled
with the phylogenetic distance, the fraction of enhancers without
signiﬁcant prediction showed interesting trends; more distantly
related genomes had smaller fractions of enhancers without
predictions, indicating that the smaller fraction of identiﬁed syntenic
intergenic regions may also be highly reliable. Not ﬁtting into
this overall pattern is D.willistoni, which shows a remarkably
higher fraction of enhancers for which we could not predict
orthologs. Combining all six species, we could make predictions
with signiﬁcant weighted P-value for 58% of enhancers, which
suggests that our alignment-free method performs well in detecting
2113
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[15:13 30/7/2010 Bioinformatics-btq358.tex] Page: 2114 2109–2115
M.Arunachalam et al.
Table 1. Evaluation on the REDﬂy database
Annotation term REDﬂy Analyzed Predicted
mapping1.blastoderm 77 44 36 (82%)
mapping1.cns 34 17 14 (82%)
mapping1.ectoderm 37 20 15 (75%)
mappling1.imaginal disc 47 28 21 (75%)
mapping1.pns 24 12 10 (83%)
mapping2.ectoderm 51 27 19 (70%)
mapping2.eye 18 10 5 (50%)
mapping2.imaginal disc 12 10 9 (90%)
mapping2.mesoderm 45 16 7 (44%)
mapping2.neuronal 54 26 21 (81%)
mapping2.wing 33 21 17 (81%)
mapping3.larva 69 34 27 (79%)
The ﬁrst column shows the gene expression pattern, the second the number of CRMs
in REDﬂy. The third column indicates the number of sequences that have identiﬁable
orthologous search regions; the fourth shows the number of predictions with signiﬁcant
similarity (out of the sequences in Column 3). We show results for patterns with at least
10 analyzed CRMs.
candidate orthologous enhancers with a wide range of expression
patterns. In a recent study on CRM prediction in Drosophila based
on REDﬂy (Ivan et al., 2008), experimentally validated CRMs were
grouped based on common gene expression annotation at different
levels of abstraction, such as blastoderm (which contains both the
A/P and the D/V datasets discussed above), mesoderm and neuronal.
Breaking down our results by these different groups, Table 1 shows
the fraction of enhancers for which we obtained signiﬁcant weighted
P-values. Signiﬁcant predictions are made across a wide variety
of patterns, with rates ranging from about 50% to >80%, which
indicates that the Poisson metric is applicable to not just a particular
group of enhancers. However, as information on the location of these
regions is not available in other genomes, we cannot validate these
predictions as we could above for the better studied smaller sets.
Finally, we evaluate a recent dataset of eve enhancers mapped
in genomes of scavenger ﬂies (sepsids) that are signiﬁcantly more
diverged from D.melanogaster than the other ﬂy genomes used
in our analyses so far. Although these enhancers share only short
stretches of block sequence similarity with D.melanogaster, it has
been experimentally shown that the sequences can recapitulate
expression patterns similar to Drosophila—i.e. despite strong
sequence divergence, the function appears to be conserved (Hare
et al., 2008). We applied the scanning method on six sepsid genomes
for which the orthologous intergenic regions, and the locations of
the enhancers in them, are available. The genome sequences of
sepsid species are currently not available and so we used background
scores computed from D.grimshawi as approximation, which is more
distantly related to the reference species.
Interestingly, for the enhancers eve_stripe_3+7 and
eve_stripe_4+6, our method predicted the candidate orthologous
enhancers in all six sepsid genomes with signiﬁcant P-value and
at the annotated locations (Table 2). This provides strong evidence
that our method can identify conserved orthologous regulatory
enhancers in genomes highly diverged from the reference genome,
where alignment methods fail to detect any similarity. However,
we failed to identify the orthologous enhancers for eve_stripe_2
and eve_MHE, with a sequence size in D.melanogaster of 392 and
318 bp, respectively; yet as the analysis above showed, we can
Table 2. Evolutionary conservation analysis on eve enhancers in Sepsid
genomes
Sepsid genomes Eve_stripe_3+7 Eve_stripe_4+6
S.punctum 0.009 0.0079
S.cynipsea 0.0086 0.007
D.sp. 0.0035 0.0094
T.superba 0.009 0.0107
T.minor 0.0073 0.0064
T.putris 0.006 0.0097
Similarity P-values of enhancers Eve_stripe_3+7 and Eve_stripe_4+6 are shown.
identify the eve_stripe_2 in all Drosophila genomes (results are
shown in the Supplementary Material). Sepsid intergenic regions,
and the annotated enhancers in them, are larger compared with fruit
ﬂies, and the failure may be related to sequence rearrangements,
which exceed our window size of 500 bp.
4 DISCUSSION
We presented an alignment-free method based on a Poisson metric
to identify orthologous enhancers in distantly related species for a
given known enhancer, without any knowledge of speciﬁc binding
sites in them. Based on earlier metrics applied to enriched words in
single species, we extended the metric to pairs of related genomes,
and developed methods for signiﬁcance estimation and combination
of more than two genomes. A characteristic feature of enhancers
is that they convey their function regardless of orientation relative
to the gene they regulate. As described, we therefore calculate the
similarity across both strands of an intergenic region. Alternatively,
we also evaluated collapsing counts for words and their reverse
complements, and we did not observe a signiﬁcant difference. We
validated the enhancer detection analysis on a set of enhancer
candidates initially predicted by clustering of known TFBSs that
act at very early stages of Drosophila development to deﬁne the A/P
axis of the embryo. We were able to discriminate between positive
candidates from negative candidates without enhancer function, and
provided evidence for functionality for one of two cases which had
previously not been validated.
Applying our approach to different Drosophila datasets
demonstrated the ﬂexibility of our alignment-free method. In cases
in which we failed, the enhancers were shorter than 250 bp,
suggesting that our alignment-free approach relies on a certain size
to obtain discriminative evidence. We were furthermore able to make
predictions for at least 58% of enhancers in all six species on a large
and diverse enhancer set with functions beyond just early Drosophila
development. Finally, the method was at least partially successful in
identifying enhancers of the eve gene in non-Drosophila species
(sepsids), supporting the notion that our approach can identify
CRMs in alignment-free manner across a wide range of sequence
divergence. A recent, concurrently developed approach was mostly
applied to enhancers within the same species, but interestingly also
detected the CRMs for eve_stripe_3+7 and eve_stripe_4+6 and, like
us, failed to detect CRMs for eve_stripe_2 and eve_MHE (Leung
and Eisen, 2009).
Different from some other non-alignment methods, our approach
identiﬁes conserved CRMs without prior knowledge of known
TFBSs or enriched words, and all k-mers are equally involved in
2114
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom
[15:13 30/7/2010 Bioinformatics-btq358.tex] Page: 2115 2109–2115
Conserved enhancer detection using an alignment-free method
scoring a candidate region. However, to reduce noise, it is easily
possible to restrict the scoring to a subset of k-mers with known or
predicted regulatory roles, such as TFBS consensus strings. This
may enhance the detection success in cases where binding site
density is relatively scarce, or the sequences considered relatively
short. It may also help to increase the window size and deal with
cases where rearrangements happen over large distances. There are
several directions in which our approach can be extended. We plan to
next study its application for de novo enhancer discovery, i.e. when
the CRM information is totally unknown, by scoring all possible
window pairs of two intergenic regions against each other. This
would provide an unbiased computational approach for genomewide
regulatory region detection and would offer a complementary
approach to ongoing efforts, such as the modENCODE project. This
approach would be particularly interesting when analyzing more
distantly related species, where orthologous enhancers are difﬁcult
to identify using available alignment-based methods.
5 CONCLUSION
We show that candidate orthologous enhancers in multiple
Drosophila and non-Drosophila genomes can be identiﬁed by an
unbiased alignment-free method which uses no information of
predicted or known functional motifs. Our results show promise
in the effort to characterize non-coding sequence conservation on
a higher functional rather than the nucleotide level. Since our
method showed encouraging results in a wide range of Drosophila
enhancers with different functions and active at different life stages,
it will be interesting to explore its application for the discovery of
regulatory modules in other organisms with larger and more complex
non-coding sequence spaces.
ACKNOWLEDGEMENTS
We thank David Corcoran for helpful comments, and Aaron Wise
for evaluation of third-party software.
Funding: National Institutes of Health (R01 HG004065 to U.O.); the
Human Frontier Science Program (RGY0084/2008 to P.T., U.O.).
Conﬂict of Interest: none declared.
REFERENCES
Altschul,S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
Ashburner,M. and Drysdale,R. (1994) FlyBase - the Drosophila genetic database.
Development, 120, 2077–2079.
Berman,B.P. et al. (2002) Exploiting transcription factor binding site clustering to
identify cis-regulatory modules involved in pattern formation in the Drosophila
genome. Proc. Natl Acad. Sci. USA, 99, 757–762.
Berman,B.P. et al. (2004) Computational identiﬁcation of developmental enhancers:
conservation and function of transcription factor binding-site clusters in Drosophila
melanogaster and Drosophila pseudoobscura. Genome Biol., 5, R61.
Blanchette,M. and Tompa,M. (2002) Discovery of regulatory elements by a
computational method for phylogenetic footprinting. Genome Res., 12, 739–749.
Chan,B.Y. and Kibler,D. (2005) Using hexamers to predict cis-regulatory motifs in
Drosophila. BMC Bioinformatics, 6, 262.
Cliften,P.F. et al. (2001) Surveying Saccharomyces genomes to identify functional
elements by comparative DNA sequence analysis. Genome Res., 11, 1175–1186.
Corcoran,D.L. et al. (2005) Footer: a quantitative comparative genomics method for
efﬁcient recognition of cis-regulatory elements. Genome Res., 15, 840–847.
Erives,A. and Levine,M. (2004) Coordinate enhancers share common organizational
features in the Drosophila genome. Proc. Natl Acad. Sci. USA, 101, 3851–3856.
Gallo,S.M. et al. (2006) REDﬂy: a Regulatory Element Database for Drosophila.
Bioinformatics, 22, 381–383.
Hardison,R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory
elements. Trends Genet., 16, 369–372.
Hare,E.E. et al. (2008) Sepsid even-skipped enhancers are functionally conserved in
Drosophila despite lack of sequence conservation. PLoS Genet., 4, e1000106.
Hong,J.W. et al. (2008) Shadow enhancers as a source of evolutionary novelty. Science,
321, 1314.
Ivan,A. et al. (2008) Computational discovery of cis-regulatory modules in Drosophila
without prior knowledge of motifs. Genome Biol., 9, R22.
Kantorovitz,M.R. et al. (2007) A statistical method for alignment-free comparison of
regulatory sequences. Bioinformatics, 23, i249–i255.
Kellis,M. et al. (2003) Sequencing and comparison of yeast species to identify genes
and regulatory elements. Nature, 423, 241–254.
Leung,G. and Eisen,M.B. (2009) Identifying cis-regulatory sequences by word proﬁle
similarity. PLoS ONE, 4, e6901.
Loots,G.G. et al. (2002) rVista for comparative sequence-based discovery of functional
transcription factor binding sites. Genome Res., 12, 832–839.
Ludwig,M.Z. et al. (1998) Functional analysis of eve strip 2 enhancer evolution in
Drosophila: rules governing conservation and change. Development, 125, 949–958.
Ludwig,M.Z. et al. (2002) Evidence for stabilizing selection in a eukaryotic enhancer
element. Nature, 403, 564–567.
Ludwig,M.Z. (2002) Functional evolution of noncoding DNA. Curr. Opin. Genet. Dev.,
12, 634–639.
Markstein,M. and Levine,M. (2002) Decoding cis-regulatory DNAs in the Drosophila
genome. Curr. Opin. Genet. Dev., 12, 601–606.
Nazina,A.G. and Papatsenko,D.A. (2003) Statistical extraction of Drosophila
cis-regulatory modules using exhaustive assessment of local word frequency. BMC
Bioinformatics, 4, 65.
Papatsenko,D. and Levine,M. (2005) Quantitative analysis of binding motifs mediating
diverse spatial readouts of the Dorsal gradient in the Drosophila embryo. Proc. Natl
Acad. Sci. USA, 102, 4966–5971.
Park,P.J. et al. (2002) Comparing gene expression proﬁles in genes with similar
promoter regions. Bioinformatics, 18, 1576–1584.
Siepel,A. and Haussler,D. (2004) Combining phylogenetic and hidden markov models
in biosequence analysis. J. Comput. Biol., 11, 413–428.
Sosinsky,A. et al. (2007) Discovering transcriptional regulatory regions in Drosophila
by a nonalignment method for phylogenetic footprinting. Proc. Natl Acad. Sci. USA,
104, 6305–6310.
Tomancak,P. et al. (2007) Global analysis of patterns of gene expression during
Drosophila embryogenesis. Genome Biol., 8, R145.
van Helden,J. (2004) Metrics for comparing regulatory sequences on the basis of pattern
counts. Bioinformatics, 20, 399–406.
Vinga,S. and Almeida,J. (2003) Alignment-free sequence comparison-a review.
Bioinformatics, 19, 513–523.
Wang,T. and Stormo,G.D. (2003) Combining phylogenetic data with co-regulated genes
to identify regulatory motifs. Bioinformatics, 19, 2369–2380
Wolff,C. et al. (1999) Structure and evolution of a pair-rule interaction element: runt
regulatory sequences in D. melanogaster and D. virilis. Mech. Dev., 80, 87–99.
2115
atMasarykUniversityonFebruary21,2011bioinformatics.oxfordjournals.orgDownloadedfrom