[10:28 28/8/2010 Bioinformatics-btq364.tex] Page: i414 i414–i419 BIOINFORMATICS Vol. 26 ECCB 2010, pages i414–i419 doi:10.1093/bioinformatics/btq364 A fast algorithm for exact sequence search in biological sequences using polyphase decomposition Abhilash Srikantha1,∗, Ajit S. Bopardikar1,∗, Kalyan Kumar Kaipa1, Parthasarathy Venkataraman1, Kyusang Lee2, TaeJin Ahn2 and Rangavittal Narayanan1 1Samsung Advanced Institute of Technology India Lab, Bangalore, Karnataka, India and 2Genetic Analysis Group, Emerging Center, Samsung Advanced Institute of Technology, Suwon, South Korea ABSTRACT Motivation: Exact sequence search allows a user to search for a specific DNA subsequence in a larger DNA sequence or database. It serves as a vital block in many areas such as Pharmacogenetics, Phylogenetics and Personal Genomics. As sequencing of genomic data becomes increasingly affordable, the amount of sequence data that must be processed will also increase exponentially. In this context, fast sequence search algorithms will play an important role in exploiting the information contained in the newly sequenced data. Many existing algorithms do not scale up well for large sequences or databases because of their high-computational costs. This article describes an efficient algorithm for performing fast searches on large DNA sequences. It makes use of hash tables of Q-grams that are constructed after downsampling the database, to enable efficient search and memory use. Time complexity for pattern search is reduced using beam pruning techniques. Theoretical complexity calculations and performance figures are presented to indicate the potential of the proposed algorithm. Contact: s.abhilash@samsung.com; ajit.b@samsung.com 1 INTRODUCTION Decreasing costs of sequencing personal genome have opened up many avenues of research. Several efforts related to Personal Healthcare and Pharmacogenetics are attempting to use the information in an individual’s genomic data towards personalization of healthcare. An important component of these solutions is the search for specific subsequences in a given genome. For example, Cetuximab (Eric et al., 2009)—an Epidermal Growth Factor Receptor (EFGR) inhibitor used to treat various types of cancer is ineffective if certain mutations in the KRAS gene (which lies in Exon 2 of Chromosome 12) exist. Thus, a search for appropriate mutations conducted on genetic data would be vital in prescribing the most effective treatment. Exact sequence search also finds application in fields such as Evolutionary Biology and Phylogenetics where certain subsequences of DNA are mined from genomic data of various species of organisms to understand their origin, relatedness and descent. In the above context, the challenge is to be able to perform fast pattern searches in whole genomes (a complete human genome contains 3 billion base pairs) and databases spanning Giga to Tera Bytes or more. Prevalent search methods (Altschul et al., 1990; Charras and Lecroq, 2004; Gusfield 1997; Kurtz et al., 2004; Li and Durbin, 2010; Lipman and Pearson, 1985; Ma et al., 2002; Ning et al., 2001) use techniques that have proven to be efficient for ∗To whom correspondence should be addressed. existing genomic sequences and databases, but do not scale up well for large datasets such as those of whole human genomes. Fast and efficient search methods that scale up well for large databases are therefore of great value. In this article, we address the problem of searching for all occurrences of pattern Pin a text T, where T is the reference sequence whose length can vary from several million (in case of individual chromosomes) to several billion (in case of complete genomes) bases. P is the pattern which is a few tens to a few hundred bases in length. 2 RELATED WORK Many methods have been developed for the task of pattern search. These include FASTA (Lipman and Pearson, 1985; Pearson and Lipman, 1988), BLAST (Altschul et al., 1990), PatternHunter (Ma et al., 2002), MUMMER (Kurtz et al., 2004), SSAHA (Sequence Search andAlignment using HashingAlgorithm) (Ning et al., 2001), Fast String Matching Algorithms (Lecroq, 2007) and BWA-SW (Burrows Wheeler Alignment—Smith Waterman) (Li and Durbin, 2010). Though Smith Waterman-based methods such as FASTA mines approximate matches by employing dynamic programming techniques, they are computationally very intensive. BLAST and its variants are an improvement over FASTA in that they use certain seeds for basic anchoring, which are then extended to exact or approximate matches. However, apart from being probabilistic in nature, BLAST type algorithms require large amounts of memory and computing time for pattern search in large sequences such as whole genomes. PatternHunter (Ma et al., 2002) is a similar seedbased technique but is still inefficient for applications that involve whole genomes or large databases. Recent suffix tree-based methods (Gusfield, 1997) such as Mummer (Kurtz et al., 2004) that yield exact matches have a very low-search time complexity. They represent all suffixes of the text as a plurality of inter-mingled linked lists.At times when the knowledge about genomes gets updated frequently, updating the suffix tree in place becomes tedious as the inter-mingling of linked lists is very sensitive to changes in the text data. Also, as every node in the tree is required to hold tree-related information such as pointers to their parents and children apart from text-based information, even the best implementation of suffix trees require ∼15.4 bytes per base (Kurtz et al., 2004), which scales up to 46 GB of memory to store the preprocessed Human Genome. Deterministic Finite Automaton (DFA)-based methods (Charras and Lecroq, 2004; Gusfield 1997) such as BWA-SW (Li and Durbin, 2010) combine DFA and dynamic-programming-based © The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. [10:28 28/8/2010 Bioinformatics-btq364.tex] Page: i415 i414–i419 A fast algorithm for exact sequence search in biological sequences using polyphase decomposition alignment methods. As this method involves Finite Automaton, the method does not scale well for large sequences, even for the best case of exact matches. Also, because they use dynamic programming, the memory requirements of the method are huge. Hashing-based methods such as SSAHA (Ning et al., 2001) and those proposed by Lecroq (2007) propose substring matching and Q-gram hashing method to greatly improve the time complexity. To summarize, efficient biological pattern-search algorithms must mitigate two problems. First, that of random access into text, without which the time complexity of the algorithm shoots up to an unacceptable O(LT ) (Charras and Lecroq, 2004; Knuth et al., 1977; Melichar, 1995), where T is of the order of several billion bases. This can be mitigated by employing mechanisms such as suffix trees and hash tables. Hashing methods are considered because changing data locally is an easy task when information in the corresponding sequence gets updated. The second problem is that of memory constraints. Random access algorithms (Kurtz et al., 2004; Lecroq, 2007; Ning et al., 2001) come with an undesirable space complexity of O(LT ), where LT is the length of the text T is of the order of several billion bases. In our work, we propose an efficient methodology that employs down-sampled version of T and polyphase decomposition of P to determine potential areas of exact match. These, in conjunction with hash tables and the use of Q grams to successively refine potentialsearch regions results in superior space and time complexity. Note that the present method differs from the existing methods (Lecroq, 2007; Ning et al., 2001) in that, firstly, we consider down-sampled substrings of both text and the pattern that result in large memory savings. Secondly, the Q-grams have traditionally been used only for the purposes of string matching, but in the proposed method, their locations have been used to progressively localize the potential locations of exact match. We now describe the proposed algorithm. 3 PROPOSED METHOD We first explain various terminologies we use in the exposition that follows. For this purpose, we use an example sequence, given by: S=‘actgcttctact’. (1) Let the length of the sequence S be denoted by LS. Also, S[n] is used to represent the n-th base of the sequence S. For example, S[0] = ‘a’ and S[1] = ‘c’. Downsampling (Vaidyanathan, 1993) a sequence by a factor of M means that we pick every M-th base from sequence S to form a new sequence SM given by: SM[n]=S[Mn],0≤n≤ LS M (2) where, indicates the largest integer less than the argument. For example, for the sequence S and M =3, S3 = ‘agta’ M-channel polyphase decomposition (Vaidyanathan, 1993) gives M possible down-sampled sequences for different integer-phase shifts. The generalized form of polyphase decomposition is given by: SMi[n]=S[Mn+i],0≤n≤ LS M ,0≤i≤M −1. (3) Note that SM = SM0. The polyphase decomposition of the sequence S in Equation (1) for M =3 yields: S30 = ‘agta’, S31 = ‘cccc’ and S32 = ‘tttt’ A Q-gram of a sequence S is denoted by QS(n) and is made up of Q consecutive bases starting from position n. Thus for sequence S as in Equation (1) example Q-grams are: QS(0) = ‘actg’ QS(1) = ‘ctgc’ QS(2) = ‘tgct’ QS(8) = ‘tact’ Contiguous Q-grams of a sequence S is the set: CQS = QS 0 QS 1 ···QS LS −Q (4) where the cardinality of the set CQS is |CQS|=LS −Q+1. Note that there are Q – 1 common bases between any two consecutive Q-grams in CQS. For example, given S as in Equation (1) and Q=4, CQS = {‘actg’, ‘ctgc’, ‘tgct’, ‘gctt’, ‘cttc’, ‘ttct’, ‘tcta’, ‘ctac’, ‘tact’}. NQS is the set of non-overlapping Q-grams of sequence S given by: NQS = QS 0 QS Q ...QS LS Q . (5) Note that: ∀ QS i and QS j ∈NQS, QS i ∩QS j =φ if i =j where φ denotes the null set. That is the Q-grams in NQS are pairwise and non-overlapping. For S as in Equation (1) and Q=4, NQS = {‘actg’, ‘cttc’, ‘tact’}. We now describe the proposed method which is composed of two stages, namely: (1) Preprocessing stage. (2) Pattern-search stage. The detailed description of each stage is presented in the following subsections. 3.1 Preprocessing stage A block diagram for the preprocessing stage is shown in Figure 1. In this step, text T is downsampled and processed into a hash table to support random access into the text. A hash table is a data structure that efficiently links keys to corresponding values called buckets (Cormen et al., 1990). In our case, the key refers to each distinct Q-gram while bucket refers to the list of locations of that Q-gram in the text. Here, T is first downsampled by a factor of M to give TM and a hash table HTMQ is then constructed using contiguous Q-grams of TM. The bucket of a given Q-gram q is denoted by HTMQ[q]. The details of the preprocessing algorithm are given below. This is followed by an example on sample text. Algorithm–A (1) Downsampled the text T by a factor of M to yield TM. (2) Generate the set of contiguous Q-grams for TM, namely, CQTM. (3) For each Q-Gram in CQTM, consult the key field of the hash table HTMQ. i415 [10:28 28/8/2010 Bioinformatics-btq364.tex] Page: i416 i414–i419 A.Srikantha et al. Fig. 1. Block diagram of the preprocessing stage. Table 1. A sample Q-gram hash table Key: Q(=3) gram Bucket: Locations’ list agt 0, 3, 6 gta 1, 4, 7 tag 2, 5 taa 8 aac 9 aca 10 (4) If the Q-Gram exists in the key field, append the new position in the bucket HTMQ[Q-gram]. (5) Else, add a new Q-gram key and its position in the corresponding bucket in HTMQ. Example: Consider the sequence T = ‘acc gat tag aag ggt tta aga gtc tca acc aga cta agc’. For M =3 and Q=3, the result of the algorithm is given below. (1) T3 = agtagtagtaaca. (2) CQT3:{agt(0), gta(1), tag(2), agt(3), gta(4), tag(5), agt(6), gta(7), taa(8), aac(9), aca(10)}. Note that we have also mentioned the indices for each Q-gram in CQT3 in parentheses. (3) Finally the HTMQ is as given in Table 1. 3.2 Pattern-search stage The idea behind the pattern-search algorithm is that given T and P, if P occurs in T at unknown locations, it is necessary that at least one downsampled polyphase PMi of P occurs in TM. Note that the reverse need not be true, that is, a certain PMi occurring in TM does not guarantee that P occurs in T. Therefore, we first mine for all occurrences of PMi in TM and search around the resulting indices for an exact match in T. Figure 2 presents the block diagram of the procedure. The definitions of crucial variables are given below: (1) PPAll: this is the array that holds the locations of exact matches for all polyphases of P. The elements of PPAll denoted by PPAll(n). (2) PP: this is the array that holds the locations of exact matches of a particular polyphase of P. The elements of PP are denoted by PP(n). The algorithm breaks each polyphase into non-overlapping Q-grams and bases its search on a successive refinement principle by employing beam pruning technique. That is, matches to the first non-overlapping Q-gram are first found. If these matches extend to the next Q-gram, then these locations are retained. This process is carried out for all the Q-grams in the given polyphase. At the end of this process, those locations where all the Q-grams match represent the locations where the polyphase PMi matches TM. These locations are then mapped to the original text T where the final search takes place. In this manner, the algorithm successively refines the search regions and thus speeds the search process. The algorithm is presented below followed by an example on a sample pattern. Algorithm—B (1) Generate PMi 0≤i≤M −1 from P using Polyphase Decomposition. (2) Initialize PPAll = {φ} (an empty array). (3) For each polyphase PMi, do (4) and (8). (4) Generate NQPMi (set of non-overlapping Q-grams of polyphase PMi). (5) PP= HTMQ[NQPMi(0)] (this represents all positions in TM where an exact match is found for the first non overlapping Q-gram of PMi). (6) For all NQPMi(j)∈NQPMi, j>0 (the rest of the entries in NQPMi) do (7). (7) Delete from PP, PP(k) such that PP(k) + jQ /∈HTMQ[NQPMi(j)] NQPMi(j) should have occurred at location PP(k) + jQ if there was an exact match. But, if that is not the case, there is no chance of an exact match of polyphase PMi in TM at location PP(k). Hence, it must be pruned from the array PP. (8) Append all PP (k)∈ PP into PPAll. That is, make note of all locations of exact match of PMi in TM in the array PPAll that holds the locations of exact match for all PMis in TM. (9) Translate PPAll(n) to corresponding location in original text T and verify exact match of P in T in that location. Note that Steps (4) and (8) deal with extracting exact locations of PMi in TM. Step (7) is a beam-pruning method to prune out non-exact-match locations of PMi from the PP. Example: Consider P= ‘aag ggt tta aga gtc tca’. Also M, Q and T as are the same as those used in the previous illustration. We show the steps of the algorithm for one of the polyphases: PM0. The results of the other two polyphases are presented in Step (8). i416 [10:28 28/8/2010 Bioinformatics-btq364.tex] Page: i417 i414–i419 A fast algorithm for exact sequence search in biological sequences using polyphase decomposition Fig. 2. Block diagram of the pattern-search stage. (1) PM0 = ‘agtagt’ PM1 = ‘agtgtc’ PM2 = ‘gtaaca’. (2) PPAll = {φ} (an empty array) (3) Iterating for PM0 for Steps (4) and (8). (4) NQPM0 = {agt(0), agt(1)} (note that we have mentioned the indices of the Q-gram in NQPM0 in parenthesis). (5) PP= HTNQ[NQPM0(0)] = [0,3,6]. (6) Iterating over {agt(1)} for Step (7). (7) Prune only PP(2). Because PP(2)+1×Q=6+3=9 /∈ HTNQ[NQPM0(1)]. (8) PPAll = {0,3}, (a) result of PM1: PP= {φ}; (b) result of PM2: PP= {7}. Therefore, PPAll = {0,3,7}. (9) Translating the locations of PPAll to those in T gives {0,9,19}. Note that we have considered the actual polyphase the exact match has come from. For example, as PPAll[2] = 7 come from the second polyphase, the mapping of this location to a location in T would be PPAll[2] ×M −2=19. Exact matches are now verified starting from these locations in T. In this example, the exact match occurs at index 9 in T. Note that if the length of PMi is not an integral multiple of Q, then certain trailing nucleotides will be missed by the NQPMi. One of the solutions in order to avoid this loss is to verify exact match of PMi in TM for all PP(n) just before Step (8). 4 ALGORITHM ANALYSIS We now present the space and time complexity analysis of the proposed method. 4.1 Space complexity of hash table From Algorithm A, because every overlapping Q-gram in TM contributes to an entry in the hash table, the algorithm’s space complexity is O(LTM) where the length of TM, LTM =LT /M due to downsampling. Also, because constructing the hash table is a sequential process, its time complexity is O(LTM). Note that there are two types of memory in a computer—main memory, which is fast, costly and scarce and the secondary memory, which is slow, cheap and abundant. An efficient design of an algorithm is an optimal balance between its speed and its main memory requirements. In our design, we have retained a smaller, down-sampled version of the text T in the main memory, which we consider for space complexity calculations. For the purposes of exact match verification in Step (9) of Algorithm B, the original text T can reside in the cheaper secondary memory, and relevant sections (with length <