Pairwise sequence alignment Bioinformatics - lectures Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling Pairwise sequence alignment database searching alphabets and complexity algorithms and programs sequences and sub-sequences identity and similarity dotplot local and global similarity pairwise database searching Database searching Database search can take a form of text queries or sequence similarity searches. Text queries are problematic due to missing annotations in many sequences. query sequence = probe searched sequence = subject The purpose of searches is to identify evolutionary relationships (homology) from sequence similarity. Important for search of analogous family members in different species. Alphabets and complexity ■ A sequence consists of letters from an alphabet. ■ The complexity of the alphabet is defined by the number of letters it contains: DNA = 4 proteins = 20 Special letters can be used for ambiguous bases "1 or residues (X). Sequence searching programs must be able to deal with them. Algorithms and programs ■ Algorithm is a set of steps that define a certain computational process. Program algorithm. implementation Same algorithm may be implemented in many programs. Sequences and sub-sequences ■ Alignment of two short sequences: Unaligned Sequence 1 (query) Sequence 2 (subject) score = 6 AGGVLIIQVG I I II I I AGGVLIQVG Aligned Sequence 1 (query) Sequence 2 (subject) score = 9 AGGVLIIQVG I I I I I I III AGGVLI QVG Score increases by the insertion of a gap. The gap increases the number of aligned identical residues. Alignment of a sub-sequence with full sequence A B A B Identity and similarity Introduction gaps solely maximise identities is not biologically meaningful. Scoring penalties are introduced to minimise opening and extension of gaps. Unitary matrix (counting identities) is replaced by similarity matrix (counting similarities) = hiah-scorina matches are replaced bv are replaced biologically meaningful low-scoring matches. Diagnostic power of similarity matrices is higher. (a) A c G T A 1 0 0 0 C 0 1 0 0 G 0 0 1 0 T 0 0 0 1 (b) CSTPAGNDEQHRKMI LVFYWBZX C 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 H 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 K 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 F 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Identity and similarity Dayhoff Mutation Data Matrix ** score is based on the concept of Point Accepted Mutation (PAM) ** evolutionary distance 1 PAM = probability of a residue mutating during a distance in which 1 point mutation is accepted per 100 residues ** 250 PAM matrix - similarity score equivalent to 20% matches remaining between two sequences = suitable for identification of similarities in twilight zone ** limitation: derived from alignment of sequences >85% identical Mutation Data Matrix for 250 PAMs C 12 S 0 T -2 P -3 A -2 G -3 2 1 3 10 6 1112 10-115 N -4 D -5 E -5 Q -5 10-100 0 0-101 0 0-100 -1-10 0 -1 2 2 4 1 3 4 12 2 4 H -3 R -4 K -5 -1 -1 0 -1 -1 0-1 0-2 -3 0 0-1-1 -2 2 113 0-1-1 1 10 0 1 6 2 6 0 3 5 H -5 I -Z L -6 V -2 -2 -1 -2 -1 -3 -1 0 -Z -1 -3 -3 -2 -3 -2 -4 -10-10 -1 -2 -3 -2 -1 -2 -2 -2 -2 -3 -4 -3 -2 -2 -2 -2 -2 -2 0 0 -2 -2 -2 -2 -3 -3 -2 -2 -2 6 2 5 4 2 6 2 4 2 4 F -4 Y 0 W -S -3 -3 -5 -4 -5 -3 -3 -5 -3 -5 -2 -5 -6 -6 -7 -4 -6 -5 -5 -2 -4 -4 -4 -4 -7 -7 -5 -2 -4 -5 0 -4 -4 -3 2 -3 0 12-1 -2 -1 -1 -2 -4 -5 -2 -6 9 7 10 0 0 17 C S T P A G N D E Q H R K M I L V F Y W Identity and similarity BLOSUM matrices >► BLOcks Substitution Matrix ** derived from blocks of aligned sequences in BLOCKS database - represents distant relationships implicitly ** bias from identical sequences is removed by clustering *■ BLOSUM62 = matrix derived from sequences clustered at 62% or greater identity Identity and similarity Statistical measures of alignment significance ** performing sequence alignment computationally = creating match according to mathematical model ** adjustable parameters: gap penalties, impact of sequence length, effect of alphabet complexity ** level of confidence to constructed alignment is quantified by statistical parameters: probability (p) - probability that the constructed alignment arose by chance [should approach 0] expected frequency (E) - number of hits one can expect to see by chance [should be <0.001] Example hit list from a database search Sequences producing significant alignments: sp P51698 ILINB PSE PR (LINB) sp Q5Ü642 | YP79 MYCTU (RV25 7 sp P27652 ILUCI RENRE Renill sp Q50600 I YJ33~ MYCTU (RV183 sp Q5Ü67Ü | YM96 MYCTU (RV22 9 sp P22643 IHALO XRNRU (DHLR) sp P34913 IHYES HUMRN (EPHX2 sp 007214 | YR15 MYCTU (RV271 sp Q5Ü599 I YI34~ MYCTU (RV183 sp 031158 | PRXC PSEFL (CPO.. sp P22862 IESTE PSEFL Rryles sp P23106 IXYLF PSEPU (XYLF) sp P29715 IBPR2 STRRU (B P0R2 sp P49323 | PRXC STRLI (CPO.. sp P54549 i YQ jl; BRCSU ÍYQJL) sp P48972 IMYBB MOUSE (MYBL2 sp Q55921 IPRXC SYNY 3 (SLR03 sp Q9JSR6 |PIP NE1MB ' [PIP--) sp 013912 IYDW6 S CH PO (S PRC2 sp Q59695 |RCOC PSEPU (RCOC) sp P46544I PIP LRCDE ' ;pepip) sp P46542I PIP LRCDL ' [PIP--) sp P10244I MYBB HUMRN (MYBL2 sp Q15811I ITSN HUMRN (ITSN. 1,3,4,6-tetrachloro-l,4-cyclohexadie 9..)Hypothetical 33.7 kDa protein Rv a-luciferin 2-monooxygenase (EC 1.13 3C..)Hypothetical 32.2 kDa protein R 6. .)Putative haloalkane dehalogenase Haloalkane dehalogenase (EC 3.8.1.5) )Soluble epoxide hydrolase (SEH) (EC 5..)Hypothetical 36.9 kDa protein Rv 4..)Hypothetical 31.7 kDa protein Rv )Non-heme chloroperoxidase (EC 1.11. terase (EC 3.1.1.2) (Rryl-ester hydr 2-hydroxyrnuconic semi aldehyde hydro 1 )Non-haem bromoperoxidase BP0-R2 (EC )Non-heme chloroperoxidase (EC 1.11. Hypothetical 28.2 kDa protein in GLN ..)Myb-related protein B (B-Myb).[Mu 14)Putative non-heme chloroperoxidas Proline iminopeptidase (EC 3.4.11.5) 3C11.06C)Hypothetical 60.1 kDa prote Dihydrolipoamide acetyltransferase c Proline iminopeptidase (EC 3.4.11.5) Proline iminopeptidase (EC 3.4.11.5) ..)Myb-related protein B (B-Myb).[Ho .)Intersectin (SH3 domain-containing Score E (bits) Value 616 e-176 450 e-126 218 2e-56 102 8e-22 93 7e-19 87 5e-17 49 2e-05 47 5e-05 45 2e-04 45 2e-04 44 6e-04 40 0.008 39 0.011 37 0.054 36 0.093 36 0.12 36 0.16 36 0.16 34 0.47 34 0.62 33 1.1 32 1.4 30 9.2 30 9.2 Dotplot The most basic visual method for comparison of two sequences. Separates noise (random dots) from the signal (adjacent dots). Identical sequences are represented by single central diagonal line, similar sequences by a broken diagonal and dissimilar sequences by random dots. Advanced dotplots utilise similarity matrices for calculation of cell scores. M T F R D L L s v s F E G p R P D S S AG G S S A G G M x - T x —., F X x ■ R X x D X X L X X L x x S x x x x X x V x S x x x x x x F \ X x £ X G X XX XX P x x R x x P x x D x x S x x x x X X S x x x x X x A x x 6 X XX XX 6 x XX XX (a) 133 <3 a. i i 0 0 133 LYC1 PIG (b) 133 < LU Ü 0 / 0 133 LYC1 ANAPL (c) 133 a. i O 0 0 128 LYC1 MACRG Local and global similarity Alignments are mathematical models whose behaviour can be modified through the use of adjustable parameters. The models constructed by dynamic programming algorithms - finding solution of a problem by solving smaller, but similar sub-problems. Global alignment - considers similarity across entire sequence. Local alignment - considers similarity in parts of sequences only. A I M S A O S Alignment AIM-S A-MOS Local and global similarity Global alignment ** Needleman and Wunsch algorithm ** suitable for sequences similar across most of their length (usually closely related) 1. construction of 2D similarity matrix ("dotplot". 2. successive summation of the cells in the matrix starting from N-terminal end —►progressing through the sequence 3. construction of maxi mum-match path through the entire sequence Local and global similarity Local alignment ** Smith-Waterman algorithm ^suitable for distantly related sequences displaying local regions of similarity (functionally-relevant or structurally-relevant) ** each point of the matrix defines the end point of a potential alignment = edge cells of the matrix are initialised to 0 ** possibility for ending the alignment are calculated for every cell ** algorithm is much faster compared to global similarity algorithms (a) Global vs. Global (b) Local vs. Global (c) Local vs. Local Pairwise database searching Extension of the pairwise sequence alignments. Large database searches can not be performed using the original Needleman and Wunsch or Smith-Waterman alaorithms due to time limitations. methods Very fast local-similarity sea re. employing heuristics = FastA and BLAST. These methods concentrates on finding short identical matches. Pairwise database searching FastA algorithm by Lipman and Pearson (1985) identifies short words (k-tuples) common to both sequences k-tuples for proteins: 1-2 residues k-tuples for DNA: up to 6 bases k-tuples lying close to each other on the same diagonal joined by heuristics —* gapped alignments computed by dynamic programming Output from FastA search FASTA searches a protein or DNA sequence data bank -version 3.3t09 Hay 18, 2001 Please cite: H.P. Pearson £ D.J. Lipman PNAS (1988) 85:2444-2448 S : 1- : 2 9 6 aa EHBOSS_001 vs SHISS-PPOT Protein Sequence Database library searching /ebi/services/idata/fastadb/swissprot library 37135523 residues in 101247 sequences statistics extrapolated from 60000 to 101082 sequences Expectation_n fit: rho(ln(x)) = 5.8158+/-0.000184; mu= 4.0375+/- 0.010 nean_var = 74.43 8 6+/-14.72 0, 01 s: 13 2 Z-trini: 20 E-trim: 0 in 0/65 Lanibda= 0.1487 FASTA (3.39 Hay 2001) function [optimized, EL50 matrix (15:-5)] ktup: 2 join: 3 6, opt: 24, gap-pen: -12/ -2, mdth: 16 Scan time: 1.930 The best scores are: opt bits E(101082) SIiJ:LINB_PSEPA P51698 1,3 , 4, 6-TETPACHLOPO-l, 4-CYCL ( 296) 2041 447 2.4e-125 SM:YP79_HYCTU 050642 HYPOTHETICAL 33.7 KDA PPOTEI ( 300) 1494 330 5.le-90 SIiJ:LUCI_PENPE P27652 PENILLA-LUCIFEPIN 2-HONOOXYG ( 311) 744 169 1.4e-41 SH:DHPD_PSESP P19076 2-HYDPOXYHUCONIC SEHIALDEHYD ( 283) 169 46 0.00017 SH: PPXC_PSEFL 031158 NON-HEHE CHLOPOPEPOXIDASE (E ( 273) 168 45 0.00019 SIiJ:PPXC_STPLI P49323 NON-HEHE CHLOPOPEPOXIDASE (E ( 275) 140 39 0.012 SliJ:PIP_EACCO P46541 PPOLINE IHINOPEPTIDASE (EC 3. ( 288) 140 39 0.013 SM:PPXC_SYNY3 055921 PUTATIVE NON-HEHE CHLOPOPEPO ( 276) 125 36 0.11 SliJ:PIP_NEIGO P42786 PPOLINE IHINOPEPTIDASE (EC 3. ( 310) 122 35 0.2 »SliJ:LINE_PSEPA P51698 1, 3 , 4, 6-TETPACHLOPO-l, 4-CYCLOHEXA (296 &&) mitn: 2 041 mitl: 2 041 opt: 2 041 Z-score: 2 372.6 bits: 447.0 E(): 2.4e-12 5 Siiiith-lilateriiian score: 2041; 100.000% identity (100.000% ungapped) in 296 aa overlap (1-296:1-296) 10 20 30 40 50 60 EHEOSS HSLGAKPFGEKKFIEIKGPPHAYIDEGTGDPILFOHGNPTSSYLIiJPNIHPHCAGLGPLIA Slil: L IN HSLGAKPFGEKKFIEIKGPPHAYIDEGTGDPILFCjHGNPTSSYLIiJPNIHPHCAGLGPLIA 10 20 30 40 50 60 Pairwise database searching BLAST Basic Local Alignment Search Tool algorithm by Altschul eta/. (1990) identifies short ungapped sub-sequences (segment pairs) of the same length sub-sequences are extended using dynamic programming to obtain local alignments - high scoring pairs (HSPs) improved algorithm by Altschul eta/. (1997) -produces gapped alignments algorithm very fast - most commonly used for databases searching Output from BLAST search BLASTP 2.0.14 [Jun-29-2000] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffen, Jmghui Zhang, Zheng Zhang, Webb Milieu, and David J. Liprnan (1997), "Gapped BLAST and PSI-BLAST: a neu generation of protein database search programs' Nucleic Acids Res. 2 5:3389-3402 Query= /net/nf sO/vo 11/production/^ nobody/tmp/918495 . 53 5 0-S0758.blastall.a [Unknown form], 297 bases, 818F03BD checksum. f 296 letters) Database: swissprot 101,247 sequences; 37,135,523 total letters Searching..................................................done Sequences producing significant alignments: SliJ:LINB_PSEPA P51698 1,3,4,6-TETPACHLOPO-l, 4-CYCLOHEXADIENE SliJ:YP79_MYCTU 050642 HYPOTHETICAL 33.7 KDA PPOTEIN PV2579. SliJ:LUCI_PENPE P27652 PENILLA-LUCIFEPIN 2-MONOOXYGENASE (EC 1 SliJ:DMPD_PSESP P19076 2-HYDPOXYMUCONIC SEMIALDEHYDE HYDROLASE SliJ:PPXC_PSEFL 031158 NON-HEME CHLOPOPEPOXIDASE (EC 1.11.1.10 SliJ:BPA2_STPAU P29715 NON-HAEM BPOMOPEPOXIDASE BPO-A2 (EC 1.1 SliJ:PIP_BACCO P46541 PROLINE IMINOPEPTIDASE (EC 3.4.11.5) (PI SliJ:PIP NEIMB 09JZP6 PROLINE IMINOPEPTIDASE (EC 3.4.11.5) (PI. Score E (bits) Value 616 e-176 450 e-126 218 2e-56 50 9e-06 45 2e-04 39 0.011 39 0.014 36 0. 16 >SliJ:LINB_PSEPA P51698 1, 3 , 4, 6-TETPACHLOPO-l ,4-CYCLOHEXADIENE HYDROLASE 3.8.1.-) (1,4- TCDN CHLOPOHYDPOLASE). Length =2 96 EC Score = 616 bits (1572), Expect = e-176 Identities = 296/296 (100%), Positives = 296/296 (100% Ouery: 1 MSLGAKPFGEKKF IE IKGPPMAYIDEGTGD PI LFCjHGNPTSSYLIiJPTJIMPHC AGLGPL IA 60 MSLGAKPFGEKKFIEIKGPPMAYIDEGTGDPILFOHGNPTSSYLMRNIMPHCAGLGPLIA Sbjct: 1 MSLGAKPFGEKKF IE IKGPPMAYIDEGTGD PI LFCjHGNPTSSYLIiJPTJIMPHC AGLGPL IA 60