CG020 Genomika Lesson 2 Genes Identification UNI S C I Jan Hejatko Functional Genomics and Proteomics of Plants, Mendel Centre for Plant Genomics and Proteomics, CEITEC - Central European Institute of Technology and National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno hejatko(S)sci.muni.cz, www.ceitec.eu Literature ■ Literature sources for Chapter 02: ■ Plant Functional Genomics, ed. Erich Grotewold, 2003, Humana Press, Totowa, New Jersey Majoros, W.H., Pertea, M., Antonescu, C. and Salzberg, S.L. (2003) GlimmerM, Exonomy, and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Research, 31(13). ■ Singh, G. and Lykke-Andersen, J. (2003) New insights into the formation of active nonsensemediated decay complexes. TRENDS in Biochemical Sciences, 28 (464). ■ Wang, L. and Wessler, S.R. (1998) Inefficient reinitiation is responsible for upstream open reading frame-mediated translational repression of the maize R gene. Plant Cell, 10, (1733) ■ de Souza et al. (1998) Toward a resolution of the introns earlyylate debate: Only phase zero introns are correlated with the structure of ancient proteins PNAS, 95, (5094) ■ Feuillet and Keller (2002) Comparative genomics in the grass family: molecular characterization of grass genome structure and evolution Ann Bot, 89 (3-10) ■ Frobius, A.C., Matus, D.Q., and Seaver, E.C. (2008). Genomic organization and expression demonstrate spatial and temporal Hox gene colinearity in the lophotrochozoan Capitella sp. I. PLoS One 3, e4004 ^CEITEC Outline Forward and Reverse Genetics Approaches ■ Differences between the approaches used for identification of genes and their function Identification of Genes Ab Initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology Experimental Genes Identification ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries ■ Forward and reverse genetics Outline Forward and Reverse Genetics Approaches ■ Differences between the approaches used for identification of genes and their function Forward vs. Reverse Genetics Revolution in understanding the term „gene" .classical" genetics approaches „reverse genetics" approaches 5TTATATATATATATTAAAAAATAAAATAA Identification of the role 01ARR21 gene •Hypothetical signal transducer in two-component system of Araoiaopsis 6 ^CEITEC Identification of the role OÍARR21 gene Recent Model of the CK Signaling via Multistep Phosphorelay (MSP) Pathway HPt Proteins • AHP1-6 NUCLEUS PM AHK sensor histidine kinases • AHK2 • AHK3 • CRE1/AHK4/WOL Response Regulators ARR1 ~24 REGULATION OF TRANSCRIPTION INTERACTION WITH EFFECTOR PROTEINS Identification of the role oi ARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST 8 ^CEITEC Identification of the role of ARR21 gene — isolation of insertional mutant Searching in databases of insertional mutants (SINS) Inserts IIIS : 01 09 64 Query: 80 t pet a gcgt t cat ga gcgt a ccat a et t ga caana gagaa cgt age cage- cat tt acagg 139 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Sbjct: 5 8319 tcctagcgttcatgagcgtaccatacttgacaagagagaacgtagccagccatttacagg 5837 8 Arr21: 1830 InsertsIHS: 010964 Query: 140 tttgatatctcttgtcaaaaatgtttttggattttactgt 179 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Sbjct: 5 8379 tttgatatctcttgtcaaaaatgtttttggattttactgt 58418 Arr21: 1890 Localization of dSpm insertion in genome sequence of ARR21 using sequenation of PCR products 16k-d11 ATG I D2 D1 K W 1727 bp 1728 bp _16k-16p Identification of the role oi ARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST • Expression of ARR21 in wild-type and inhibition of expression of ARR21 in insertional mutant confirmed at the RNA level 10 ^CEITEC Identification of the role of ARR21 gene - analysis of expression wild type expression insertional mutant vs wild type 11 ££2>(=EEITEEC= Identification of the role oi ARR21 gene • Hypothetical signal transducer in two-component system of Arabidopsis • Mutant identified by searching in databases of insertional mutants (SINS-sequenced insertion site) using BLAST • Expression of ARR21 in wild-type and inhibition of expression of ARR21 in insertional mutant confirmed at the RNA level • Phenotype analysis of insertional mutant 12 ^CEITEC Identification of the role of ARR21 gene - phenotype analysis of mutant Analysis of sensitivity to plant growth regulators ■ 2,4-D a kinetin ■ ethylene ■ Light of various wavelengths No alterations - nor in flowering, neither in the number of the seeds 100 Q I (N 30 10 '"<3X I \? O n? » 3 10 30 100 300 1000 kinetin ■ I1 Identification of the role of ARR21 gene - possible reasons for the absence of the phenotype • Functional redundance within the gene family Identification of the role of ARR21 gene - homology of ARR genes Legenda: □ ARR-A ■ ARR-B O nalezena alespoň jedna EST \ I Identification of the role of ARR21 gene - possible reasons for the absence of the phenotype • Functional redundance within the gene family? • Phenotype only under specific conditions i^CEITEC Identification of the role of ARR21 gene - summary Gene ARR21 identified by comparative analysis of Arabidopsis genome Based on sequence analysis, its function was predicted Site-specific expression of ARR21 gene was proved at the RNA-level Identification of gene function by insertional mutagenesis in case of ARR21 in development of Arabidopsis was not successful, probably because of functional redundancy within the gene family Outline ed Identification of Genes Ab Initio ■ Structure of genes and searching for them Genes Structure Promoter ATG....ATTCAT( 5'UTR ATTATCTGATATA... .ATAAATAAATGCGA 3'UTR MASARYK UNIVERSITY RNA Splicing 5 enon 5 , splice site intron 3 splice site ♦ 3 enon conserued regions 20 ^CEITEC Identification of Genes Ab Initio Omitting 5' and 3' UTR ■ Identification of translation start (ATG) and stop codon (TAG, TAA, TGA) ■ Finding donor (typically GT) and acceptor (AG) splicing sites ■ Using various statistic models (e.g. Hidden Markov Model - HMM, see recommended literature, Majoros et a/., 2003) to evaluate and score the weight of identified donor and acceptor sites 21 ££2>(=EEITEEC= Splicing Site Prediction Programs for splice site prediction (specifity approximately 35 %) □ GeneSplicer (http://www.tiqr.org/tdb/GeneSplicer/qene spl.html) □ SplicePredictor (http://deepc2.psi.iastate.edu/cgi-bin/sp.cgi) SplicePredictor „ „„ _ Bioinformatics 2 ~ , . . . _ . . . „ ,. ^. , BCB JD ISU Download Help Tutorial References Contact SplicePredictor - a method to identify potential splice sites in (plant) prc-mRNA by sequence inspection using Bayesian statistical models (click here to access the older method using logitlincar models) Sequences should be in the one-letter-code ({a,b,c,g,h,k,ni,n,r,s,t,u,\v,y}), upper or lower case; all other characters are ignored during input. Multiple sequence input is accepted in FAST A format (sequences separated by identifier lines of the form ">SQ;name_of_sequcnce comments") or in C en Bank format. Paste your genomic DNA sequence here: GAGGAGGCACAAAATGACGAATATACAAAATGAT C TTAAAGAGGTAAACTATAT TGGACATTTTTTCGAT CT CAGATATA AAAGATTTCATTCAATATAATACTTGGATAAATACTCTTATTATTTTTCTTTAGTTTATTAAAAAAAACCTCTAATAAAT ACGAGTTTAAGTCCACAAAATCGCTTAGACTAAAATACACCATATAATTTCAAACGATAAAGTTTACAAAAGTAATATCC AAGTATGTCATAGTCAACATATATATAGTAATAAT TAGT T GAG GTATAAGAAAATAAAAATAAATAAAT TAGTATCTTAT TTTGGGTGGTGCTGACTGGTGACTGGTGACTGCAGAATGCTCGGCAAATGGAACCATATCCCAAGACATGGGTTTTAGAT ... or upload your sequence file (specify file name): \ Browse.. ... or type in the Gen Bank accession number of your sequence: 23 ,^CZEITEC SplicePredictor What do the output columns mean? SplicePredictor. Version of February 13, 2 005. Date run: Wed Nov 9 11:30:14 2005 Species: Mode1: Prediction, cutoff (2 In [BF] ) : Local pruning: Non-canonical si tes: Homo sapiens 2-class Bayesian 3.00 on not scored your-sequence, from 1 to 9490. Potential splice sites CCSAATSCCTGflGATATTGTTTC:TAAAA"GAGATGATTGTTT"TA"TTA"TACCATGATTTST"TSTACTAASC"TCCTTTCCCCTTTGCAATACATAGGATATAAATTCATACATGTTCCTAATTT"AT"TT GG:T"A:GGAC"CTATAACAAAG2A"TTTACTCTAC"AACAAAAATAAATAATGGTACTAAACAAACATGATTCGAASGAAAGGGGAAACGTTATSTATCCTATATTTAAGTATGTACAAGGATTAAAATAAAA BpuEl Bglll "G:ACTTGA2T"TATG2T"TTC"TTGGTGGAAGATC"ATAT2TAT:"ATA"CTATAT"ATTTTACT:T"TTCTTCGTCGT:A"TTATAG"ATATTA"ATATATGr_ACA:ACA:ACACAr_:TATA"GTA"AGCT: ACGTGAACTCAAATACCAAAAGAAACCACCTTCTAGATATACATAGATATAGATATAATAAAATGAGAAAAGAAGCAG.CAGTAAATATCAtataatATATATACGTGTGTGTGTGTGTGGATATACATATCGAG AATTCTAGATAAAATATATAGAAATGGATCTTGAGAATCATTTTTTTTGTATTCTTTTGTTATCAAASG3T"T:GACT"T2C"CCGA2GAAGAAGATAATATGAAAAGAGCTTT"TAGG3T"TA"CAT"CTCCT TTAAGATCTATTTTATATATtTTTACCTAGAACTtTTAGTAAAAAAAAtATAAGAAAACAATAGiTTTCCCAAAGCTGAAAtGiAGGCTCCTTCTTtTATTATACTTTTCTCGAAAAATCCCAAATAGTAAGAGGA < r a -uORF- 1. • loc sequence rho g arj3.a * 3 3 >G* a <— 75 ttttttcgatctcAGat 'J 973 7 13 3 333 0 000 7 ;5 i 1) A <— ::;í attatttttctttAGtt 0 14 86 3 33 0 000 7 (5 i 1) A <— 500 gattttgttgtttAGtc 0 977 7 48 3 ooc 0 000 7 i 1) R <— ■/«:_: tctgttattgtatAGct 0 986 3 56 3: ooc 0 000 7 -.!-■ i 1) A <— 346 tattttttgaaatAGat 0 968 6 80 3 ooc 0 000 7 [5 i 1) A <— 1051 =aatttatttttaAGaa 'J 93C 5 19 3 33 3 0 000 7 : 3 i 1 R <— 32 33 ttatttattttttAGtt 0 998 12 14 0 33 3 D ccc 7 [5 i 1 A <— 1373 tttcctctctcacAGga 'J 999 13 17 3 33 3 0 000 7 : 3 i 1) A <— 14:-: ■ tttatatattgatAGtg 0 883 4 34 0 33 3 3 000 7 (5 i 1) A <— 1581 atgtgttgcttgtAGga 'J 982 3 03 3 33 3 0 000 7 :5 i 1) A <— : '81 ggttgtgcgaaatAGgg 0 886 4 10 0 33 3 3 000 7 (5 i 1 A <— 2440 taattaaaaatttAGat 0 939 5 46 3 ooc 3 000 7 : 3 i 1 A <— 2479 catctaaaattttAGat 0 942 5 59 3: ooc D 000 7 :5 i 1 3 ---> 2546 aagGTagta 0 90S 4 61 3 885 1 903 15 : 3 3 5; A <— 2572 ttttttttttggcAGca 'J 93C 5 13 3 33 3 0 000 7 : 3 1 i A ---- 2763 ctcaaattcacaaAGgt 0 873 3 86 0 185 3 ccc 11 [5 '-. i A < ---- 2782 tttcgttttcattAGcg 'J 952 5 96 3 32 3 D 000 11 : 3 3 i A ---- ic22 tttgtttgtactaAGct 0 9.- 6 16 0 221 3 000 11 (5 '-. i A ---- 3046 =tttgcaatacatAGga 0 973 7 1:. 3 229 3 000 11 (5 3 i A <— 3171 "gtcgtcatttatAGta 0 988 3 74 0 33 3 3 000 7 (5 1 i) A ť--- 3234 cttttgttatcaaAGgg 'J 993 10 03 0 33 3 0 006 3 (3 1 21 D ---> 3372 aatGTaagg 3 933 5 28 0 855 1 849 15 (5 5 ^| A ---- 3451 act grl 1 crA rn\ AGaa 0 91 6 4 3 3 n 293 0 065 1 7 f 3 3 3 ) A 3581 cgatcgccgttctAGgt 0 850 3 47 0 000 0 ■ (5 1 D -- —> 3 6 4 3 cacGTatta 3 933 5 25 0 33 3 1 343 11 (5 1 31 A --- 4254 attattgttottcAGat 3 998 32 82 3 33 3 3 002 8 ( 3 1 2) A <— 4351 tttcttacattgcAGaa 0 991 9 42 3 ooc 3 000 7 1 1 A <— 4633 gtcttgtttctttAGgg 0 879 3 9 33 3 3 000 7 (5 1 1 A <— 4976 cttgttgtttctcAGct 0 952 5 9E 0 33 3 3 ccc 7 [5 1 1 A <— 04 ttttttttttgccAGag 0 996 11 17 3: ooc D 000 7 :5 1 1) D — —> 5356 caaGTgaat 0 821 3 04 3 387 3 000 11 [5 '-. 1 C — —> 5384 ttgGTaaga 3 941 5 54 3 4 IS 0 090 1: : 3 3 A <— 5403 actctgtttctttAGct 0 894 4 26 0 33 3 3 ccc 7 [5 1 1) A < ---- 5441 ^tttctctctaacAGaa 0 995 10 43 3 387 3 ccc 11 :5 3 1 A ---- 5 4 72 ttgttaaaattacAGct 0 965 6 62 0 478 3 090 13 (5 '-. 3) 3 ---> 5745 gcgGTaaga 0 991 9 46 3 990 1 956 13 3 5; A < ---- ::-:<: 3 catcatatcctaaAGgt 0 948 5 3 Í 456 3 000 11 (5 3 i A ---- 6135 ggtctattattatAGgt 0 999 13 59 0 3.3 = 3 c - c 15 [5 5 2; A <— hzb2 ggattttcacctcAGag 0 938 5 45 3 ooc D 000 7 :5 1 li -0293-' : H=al Bcgl ^cgl jSnaBI tgactttglaaaa:gtííaatgtaagg:actt"gatcgttgtactttgttgctttttatacgtatcgcttcctacaataagttaa:aatglttcctcgtagaattgcaaaacatttgtggaccgtgatttacat ACTGaAACGTT"T^CA:1tTACAT TCCGTGAAaCTAGCAaGATGAAaCAACGAaAAA~ATGCATaG;GaAGGATG~TA-T ^AaT tgt TACGAaGjAGCATGT taacgt tt-gtaaacagctggcactaaatgta EcolCRI I |Sacl ^vul GACTGAGCTCTTTTCAGTGGCTTCTTTGCAGCAGCTTCTTCCTTGGAGGACTAATCAAGACAGAAATCTGTTCCTCTAAAAACGATCGCCGTTi^^^ATCTTGCCATTCTTGACGAGTCTTGATCTTTAGA ctgact:gagaaaagt:accgaagaaa:g"cg"cgaagaaggaac:_c:tgat"agt"ctgt:t"taga:aag^agatttttgctagcgg:aa: tagaa:ggtaagaactgct:agaactagaaatct ^sil |BssSI jAsf ATCAAATTTATüA2GGATCA:GaGA"A:ACGTaTTAaTTaTTA"TT"TTT"TT"TTTG[:TTTTTGTG£TT "ast"taaata"t:cctagtsc":tat^těca"aat"aa"aataaaaaaaaaaaaaacgaaaaacaccaata^Wxaagtgagtttac"a:ca:t"t:aatg"ttcgaaca:cgaagtscasg"taacacca^ - TAWWti Hlndlll TTCACTLAAAT£ATGGTGAAAGTTACAAAGCTTGTGGCTTCA:G":tAATTGTGGT: TTTTGCGTCCTGGTAATTCTGCTTTCTTTCTTCTAAATTATACGATGATTCTACATTTCTACTCATCTCGTTCTTGTTTTTCAAATGATATAATTA"TGTGTG"ATAT:ACC:A"TCATGTATA"TTATTGAAA aaaacg:aggaccattaaga:gaaagaaagaagatt"aa"atgctactaagatgtaaagatgag"agag:aagaacaaaaag"ttacta"at"aataa:acacatatagtgg2taag"a:a"ataaataac"tt -exon 4 - F c I- L , AATATAGGCATTCCTGGTGGTTGTTTTCGAGTGCATTTGGATCTGAAATTGGCGAACAACAACSGAGAACCTAGTCAAAGAGGTCGCTTCATTTACCGAAGATCTCCGGACAAGTCTAGTTTCGGAGATTGAAA TTATATCCGTAAGGACCACCAACAAAAĚCTCACGTAAACCTAGŮGTTTAACCGCT TG~TGTT 2CCT;T~2GaT 2AGT T~CTCCAGCGAAGTAAATGGCTTlTAGAGGC2 TGTTCa2A~GAAa2CCT;TaAC~TT .AFLVVVFEC I W i SNWRTTTENLVKEVASFTEDLRTSLVSEIE Splicing Site Prediction Programs for splice site prediction (specifity approximately 35 %) □ GeneSplicer (http://www.tigr.org/tdb/GeneSplicer/gene spl.html) □ SplicePredictor (http://deepc2.psi.iastate.edu/cgi-bin/sp.cgi) □ NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/) ££2>(=EEITEEC= NetGene2 IBTB". 14. trrra it; £sa CBS » Prodiction Server.; » NctGene2 NetGene2 Server The NetGene2 server is a service producing neural network predictions of splice sites in human, C. elegans and A thaliai Instructions Output format Abstract Performanc SUBMISSION Submission of a local file with a single sequence: File in FASTA format_I Browse,,! 1 $ Human Cc. elegans Ca, thaliana [ Clear fields ] Send file "| CENTERFO RBIOLOGI CALSEQU ENCEANA LYSIS CBS i. r .7. J remrararr Submission by pasting a single sequence: Sequence name CHuman Oc. elegans • a thaliana Sequence GAGGAGGCACAAAATGACGAATATÄCAAÄÄTGATCTTAAACÄGCTÄAACTATATTGGÄCATTTTTTCGATC I TCAGATATA AAAGATTTCATTCAATATAATACTTGGATAAATACTCTTATTATTTTTCTTTAGTTTATTAAAAAAAACCT CTAATAAAT ACGAGTTTAAGTCCACAAAATCGCTTAGACTAAAATACACCATATAATTTCAAACGATAAAGTTTACAAAA | [ Clear fields ] Send file | NOTE: The submitted sequences are kept confidential and will be erased immediately after processing Prediction done NetGene2 Z C GAAT GCCTGAGATATTGTTTCC TAAAA- GAGAT GAT TGT TT_TA~TTA_TACCATGATTTGT~TGTACTAAjC~TCCTTTCGCCT T TGCAATACA TAGGAT ATAAA T TGATAC.ATGT T CCT AA T TTTAT TTT ZGZ1~A"GGAC"C TATAACAAAGGA"TTTACTCTAC"AACAAAAAT AAATAA TGGTAC TAAA^ AAACA~2A~JZG AAGGAAAGGGGAAACGT TATGTATCCTATATTTAAGTAT G TACAAGGAT-AAAAT AAAA BpuEl pglll ~G:ACTTGAGT~TATGGT~TTC~TTGGTGGAAGATCTATATGTATCTATATCT; acgtgaactcaaataccaaaagaaaccaccttctagata~a:a~asatataga" actcttttcttcgtcgtcatttatagtatattatatatatgcaca:aca:acacac:t; ,TGa:,AAAAGAAGC£GCA^TAAA~ATCAT A~AATA T.ATAT ACGTGT g TGT GT G TG~GGA~ iT G T ATAGCTC TAC.ATATCGAG ********************** NetGene2 v. 2.4 * *** ********** The sequence: Sequence has the following composition: Length: 9490 nucleotides* 31.8% A, 17.0% c, 19.6% g 31.7 :■ 0.0% 36.55 g+c Donor splice sites, direct strand pos 5[->3' phase strand nf j ■ :. 5[ exon intron 3' 1704 :: + 3.87 TTCCAAACACGTTAATATTT 1906 c + 0.99 cggtgaacgg"GTCAGAACAT 3582 1 + 1.00 GCCGTTCTAGAf;i'AATCTTGC H 3765 1 + 1.00 TTGCGTCCTGAGTAATTCTGC h 4134 0 + 0.74 TCAAACACAG"GTTGTTAAAA 4619 : + 0.74 AGCAAGAAAG"GTCTTGTTTC d + 0.94 :gttcctctg"GTAaatactg 5356 d + 0.87 TCTCAACCAA~GTGAATGTTT 5384 : + 1.0c GATTTGGTTG"GTAAGACTCT h 5809 : + 1.0c rATCCTAAAG"GTGTGTCCAA 6057 d + 1.00 GCAGTCTTTG"GTAAGCTACT h 6096 1 + 0.74 ctcttcacaa"GTAAATCTAG 7369 0 + 1.0C GGACTGCCAA"GTAAGTTTAA h 7886 d + 0.74 GAACAAAATG"GTTAGATGAA 9323 d + 0.74 GAAGATTAGG"GTTTTTCTCT Donor splice sites, complement s 1 rand pos 3'->5' pos 5[->3' phase strand nf j ■ :. 5[ exon intron 3' otor splice sites, direct strand pos 51 — >3' phase strand confidence 5 intron exon 3' 1213 C Q TATTTTTTAG-TTATGGAGAC 1221 >: ■ 0 AGTTAT GGAG '■ ACAAG AATCG 1373 c 0 71 TCTCTCACAGAGACACAGAAT 1487 : -,- 0 81 ATATTGATAGATGGGACATTA ; 2'-'-. c ■i . 00 TGTTCTTCAG"ATCGCACCAT H 4832 2 1 0 54 AAAATT GCAG '■TTCCAGTGGC 5004 c ■ 0 TTTTTGCCAG" AGATACACAC 5472 ■ 0 AAAATTACAG^CTCTGCTCAA 6135 : ATTATTATAC-GTAAGATTAA H 6490 ■ 0 AAAGTTACAGATGGTGGAGAA 6744 c ; 0 TGTCAAACAG"TTTCGTAGAG 7447 0 Ö TTCTGCACAG'ATGCCAGAAA 7780 2 + Ö TCCATTTCAG"ATACAGAACA 7786 2 i 0 TC AG AT ACAG " AAC AC ATGC A AA T TGT AGATAAAATATA~AGAAA TGGAT C TTGAjAATCATTT~TT~TGTATTCTTT~[ T TAAGAT CTATT T TATAT A TCTTT ACCTAGAAGT C T~AG~AAAAAAAACATAAGAAAA( .-y,^aaaagagc t~tagggt~ta~cat~ctcct ■ ■■ ■ III .... I .... I .... I .... I .... I .... I.........I ; tAAAGC TGAAACGAGGC TCGTTC TTCTA t tatAC ~ t t tc t CGAAAAATCC ZAAATA g tAAGAGGA -Q-2Z3-' Oji* - T GACTTT GCAAAA CGT jAAATG~AAGGCAC TT_GATCGT_GTACTT~GTTGCT_TTTATACGTA~CGC~TCGTACAATAA G T~AACAATGCT ~GC TC G TAGAA~T GCAAAAGAT ~ TG~GGACCG~ GAT~T ACAT A C TGAAACGTTT TGCACT T TAGAT TCG3T GAAACT AGGAACA TGAAA CAACGAAAAA~ATGCA T AGCGAAGGAT G ~TA~T CAAT TG T TAGGAAGGAG CATCT T AACGTTTTGTAAACACC TGGCAC TAAAT GTA -exon 2-1 ECOICRI j pad pul ga;tgagct;t~ttcagtggc'T-„t-tgcagcag„t-ct-c; t~gga Gjact aat i'aa gacagaaatc-g t~c; t ctaaaaacga-cgccgt ~t ctgact:gagaaaagt:accgaagaaa:g"cg":gaagaaggaa::"cltgat"agt"ctgt:t"taga:aaGjAGat"tttg:taglggcaa(..itccat'.'.gaa:ggtaagaactgct:agaactagaaa"ct bcttgccattcttgacgagtcttgatctttaga r r ATCAAA TTTATAAGGGAT CACGAGATACAGGT A T T AATT A TTATTT T TTTTTT T TTTGCT T TTTGTGG~TA~A„AAGT~CAC~CAAA T GAT GGT GAAAGTT ACAAAGC T TGT GG C T TCACG.TCCAATT G TGGTC TAGTTTAAATATTCCCTAGTGCTCTATGTGCATAAT'AA'AATAAAAAAAAAAAAAACGAAAAACA^AATAT^T'CAAGTGAGT'TAC'ACCA^'TCAATG'TTCGAACACCGAAGT^CA^G'TAACACCA^ "TTTGCjTCC rGCTAATICTGCTTTCTTTGTTCTAAATTATACGATGATTCTACATTTCTACTCATCTCGTTCTTGTTTTTCAAATGATATOATTATTGTGTGTATATCACCCATTCATGTATATTTATTGAAA AAAa.CGCAG j/- T-Ai : ■': ■"-j aCCaAaG A.AaGaaG ATT" AA" ATGCTACTAäGATGTAäaGATGAG~AGAGCäaGAäCAäaAAG "TTaCTa'aT" aATaaCa.CACATATAiTGGaAG~ACATATAAATaacttt =1 psml j5glll pspEI aatataggcatt coggtggttgttttcgagtgcatttggatgtcaaattggc gaacaa.aacggajAACC~Ag tcaaagaggtcgc t tcat~t accgaäga tgt ccggacaagtctagtt tcggagattgaaa "tatat:cgta.ag:ac:accaacaa.aa:c"cacgtaaacctagagt"taaccgcttg"tgttgcct:t"ggat:agtt"ctccagcgaagtaaatggcttctaga^gc:tgttca:a"caaagcct:taac"tt RNA Splicing and Adaptation Flexibility in splicing site recognition in plants in practice -example of developmental plasticity of (not only) plants Identification of mutant with point mutation (transition G^A) exactly at the splice site at the I PDR exon 4 Ol RNA Splicing and Adaptation Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event PDRJJIa/PDRJJ PDR_U1b/PDR_Llb wt pisl wt pisl - 500 bp _c - 400 bp - 500 bp - 400 bp - 300 bp H - 300 bp - 200 bp - 100 bp - 200 bp - 100 bp RNA Splicing and Adaptation Flexibility in splicing site recognition in plants in practice -example of developmental plasticity of (not only) plants Identification of mutant with point mutation (transition G—>A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event Sequenation of this fragment then suggested alternative splicing with the closest possible splice site in exon 4 RNA Splicing and Adaptation Divergencies at splice site recognition in plants in practice -example of developmental plasticity of (not only) nlants_ © o o © Identification of mutant with point mutation (transition G^A) exactly at the splice site at the 5' end of the 4th exon Analysis by RT PCR proved the presence of a fragment shorter than cDNA should be after the typical splicing event Sequenation of this fragment then suggested alternative splicing with the closest possible splice site in exon 4 Existence of similar defense mechanisms was proven in different organisms as well (e.g. Instability of mutant mRNA with early stop codon formation (> 50 - 55 bp before typical stop codon) in eukaryotes, see recommended literature - Singh and Lykke-Andersen, 2003 :l um ,1 Mm MmQQ O I LEG A UAG Active NMD complex Ž^CEIT Identification of Genes Ab Initio ■ Programs for exon prediction □ 4 types of exons (according to location in the gene): initial internal terminal single □ Programs predict splice sites and they take into account the structure of the type of exon as well • initial: □ Genescan (http://hollvwood.mit.edu/GENSCAN.html) □ GeneMark.hmm (http://opal.biologv.gatech.edu/GeneMark/) • internal: □ MZEF (http://rulai.cshl.org/tools/genefinder/) GENSCAN The New GENSCAN Web Server at MIT Identification of complete gene structures in genomic DNA (o o) . .-. .-oOOo-[_)-oOOo-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. |x|||\ /i i ix| i i\ /iiixiii\ /i i ixi i i\ /ii|xii|\ / i i|x|ii\ /iiixiiix /lllxiiix /mix / \ I I i x i I i / Miixiii/ miixiii/ \ i iixii l / \i i|x| | 1/ miixiii/ \ i i i x i i i / Miixiii/ ' ■ For information about Genscan. click here "his server provides access to the program Genscan for predicting the locations and exon-intron tructures of genes in genomic sequences from a variety of organisms. "his server can accept sequences up to 1 million base pairs (1 Mbp) in length. If you have trouble with le web server or if you have a large number of sequences to process, request a local copy of the rogram (see instructions at the bottom of this page) or use the GENSCAN email server. If your browse e.g.* Lynx) does not support file upload or multipart forms,, use the older version. )rganism: Suboptimal exon cutoff (optional): name (optional): QQl^l rint options: | Jpload your DNA sequence file (one-letter code, upper or lower case, spaces/numbers ignored): br paste your DNA sequence here (one-letter code, upper or lower case, spaces/numbers ignored): 3 ag g a ggcacaaaat g ac g aat at ac aaaat ga t c t t aaac agc t aaac ta tat t g g ac attttttc gat c tcagatata aaa ga t tt cat tcaat ataatac ttggataaatactctt at tat tt ttc tt tagtt tat taaaaaaaac ct c:taataäät acgagtttaagtccacaaaätcgcttagactaaaatacaccatataatttcaaacgataaägtttacaaaa 3taatatcc aag tä t ct catag tcaacata ta tat agtaataattagt tgacgtataagaaaataaaaat aaataaat ta gtatcttat tttgggtggtgctgactggtgactggtgactgcagaatgctcggcaaatggaaccatatcccaagacatgg 3ttttagat agaacaaaataag tgt cc gaa ggaat gatattaaaagtc aaatagaataat tataaatatt gtaat tag ca aat aaaaa c GENSCAN GENSCANW output for sequence CKI1 GENSCAN 1,0 Date run: 10-Nov-105 Time: 02:24:26 Sequence CKI1 : 9490 bp : 36.53% C+G : Isochore 1 (0-43 C+GZ) Parameter matrix: Arabidopsis.smat Predicted genes/exons: Gn Ex Type s .Begin . .End . Len Fr Ph I/Ac Do/T CodRg P Tsar.. 1 00 Prom + 1497 1536 40 -3 B5 i 01 Init ■+ 3708 3764 57 2 0 63 51 37 0 499 4 03 1 02 Intr + 3894 4133 240 2 0 327 0 713 17 32 1 03 Intr + 4255 4914 660 0 0 86 59 296 0 771 22 57 1 04 Intr + 5005 5383 379 0 1 70 91 343 0 772 31 41 1 05 Intr + 5473 6056 584 2 2 38 99 582 0 722 50 76 1 06 Intr + 6136 7368 1233 0 0 68 ioa 655 0 977 56 86 1 07 Term + 7448 7660 213 1 0 43 35 212 0 999 12 65 1 08 PlyA + 7910 7915 6 -0 45 2 03 PlyA - 7976 7971 6 -4 B3 2 02 Term - 8793 8050 744 0 0 107 37 542 0 997 48 46 2 01 Init - 9253 8936 3ie 1 0 105 73 386 0 999 41 18 S-uboptimal exons with probability > 0.100 Exnum Type s .Begin . . End . Len Fr Ph B/AC Do/T CodRg P Tsar.. S.001 Init + 1867 1905 39 0 0 64 40 57 0 298 3 74 S.002 Init + 2374 2442 69 0 0 55 95 -11 0 132 2 40 S.003 Intr + 3894 4110 217 2 1 -3 -34 307 0 177 11 55 3.004 Intr + 4352 4914 563 0 2 75 59 338 0 187 26 20 3.005 Intr + 5005 5379 375 0 0 70 8 335 0 212 22 99 S.006 Intr + 5442 6056 615 2 0 95 39 589 0 208 57 32 GENSCAN CENSCAN predicted genes in sequence 02:56:23 □ c J kb an ii.5 I 0 1.5 4.( i-ir 5.0 5 5 60 S.5 V.I I I Optimal exo]i Key: Initial i [Menial Terminal Single-excn ^ exon ■ exon » exon ^ gene |-| ffion Regulation of Translation • Splicing in Untranslated Regions - important regulation part of genes Translational repression by short ORFs in 5' UTR Identified e.g. in maize (Wang and Wessler, 1998, see recommended literature for additional info.) In case of CKI1 there was an attempt to prove this mechanism of regulation using transgenic lines carrying uidA under control of two versions of promoter (unconfirmed so far) M K R A F . ATGaaaagagcttttTAG ATGatggtgaaagttaca. MKRAF. MMVKVT.. ATGaaaagagcttttTAG ATGatggtgaaagttaca. ^CEITEC Regulation of translation • Functional purpose of splicing in untranslated regions - important regulation part of genes In case of CKI1 there was an attempt to prove this mechanism of regulation using transgenic lines carrying uidA under control of two versions of promoter (unconfirmed so far) BamHI GAGGAGGCACAAAATGACGAA TGTATTCTTTTGTTATCAAAGGGTTTCGACTTTGCTCCGAGGAAGAAGATAATATGAGGATCCCCCGGGTAGGTCAGTCCCTTATGTTACGTCCTGTAGAAACCCCAACC \v (M)RI PRV GQSL ML RPVE TPT -2739 GAGGAGGCACAAAATGACGAA -//- GTTATACAAGTTCACTCAAATGATGGTGAAAGTTACAAAGCTTGTGGCTTCACGTCGGATCCCCCGGGTAGGTCAGTCCCTTATGTTACGTCCTGTAGAAACCCCAACC MMVKVTKLVASR Rl PRVGQSLMLRPVETPT - intron I exon 37 ^CEITEC Gene Modelling Programs for gene modelling □ Those that take into account other parameters as well, e.g.continuity of ORFs □ Genescan (http://hollvwood.mit.edu/GENSCAN.html) -very good foor prediction of exons in coding regions (tested for gene PDR9, Genescan identified all of the 23 (!) exons) □ GeneMark.hmm (http://opal.biologv.gatech.edu/GeneMark/) □ GlimmerHMM (https://ccb.jhu.edu/software/glimmerhmm/) GeneMark GeneMark™ A family of gene prediction programs provided by Nark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology, Atlanta, Georgia. What's New: - November, 2005 Supported Prokaryotes: predicted by nih gene database. Prokaryotes: models for V GeneMark and GeneMark. hmm. Gene Prediction in Bacteria and Archaea For bacterial and archaeal gene prediction, you can use the parallel combination of the GeneMark and GeneMark.hmm programs here. If the DNA sequence of interest belongs to a species whose name is not in the list of available models, you should use either the Heuristic models option or, if the sequence is longer than 1 Mb, generate models with the self-training program GeneMarkS. Both options will allow you to generate models and then to use GeneMark.hmm and GeneMark in parallel. Gene Prediction in Eukaryotes For eukaryotic gene prediction, you can * ^ ;■• use the parallel combination of the GeneMark and GeneMark.hmm programs here. Gene Prediction in EST and cDNA To analyze ESTs and cDNAs, please follow ne P (jOvims c link. Borodovsky Group Gene Prediction Programs • GeneMark . GeneMark.hmm • Frame-by-Frame . GeneMarkS • Heuristic models Statistics . Documented GeneMark.* usage Help • References • Papers . FAQ . Contact Databases of predicted genes • ProkaryotesNew • Viruses/Phages (VIOLIN) Bioinformatics Resources • Links Gene Prediction in Viruses nral gene prediction, or to access our ; database VIOLIN, please follow this link. What the programs do: Bioinformatics Studies at Georgia Tech • MS Degree Progr • PhD Program • Lectures • Seminars • Center for Bioinformatics and I gram ,d Eukaryotic GeneMark.hmm^1,2^ onload this paqei References: 1Borodovsky M. and Lukashin A. (unpublished) 2Lornsadze A.j Ter-Hovhannisyan v., Chernoff y. and Borodovsky M., "Gene identification in novel eukaryotic genomes by self-training algorithm" Nucleic Acids Research, 2005, Vol. 33, No. 20, 6494-6506 Accuracy comparison UPDATE October 2005. Added pre-built models of eukaryotic GeneMark .hmm ES-3.0 (E -eukaryotic; S - self-training; 3.0 - the version) Listing of previous updates Input Sequence Title (optional): &_ (ckTi Sequence:^ iitt itt c ict c ujtt c ic iiiggtt ittt cgtttt c itt igc gc c cttt ctctc gicttt cttgit giit cttt ittt ctt ct it gt giiit ,;t iittiigictitttt c gt gtt ititt git gttt uuit guut ctttt ggttttt it gttt iit c itttt c itgigt it igitttiigtt iii iit it c c giit gc ctgigit itt gttt cct iiiitgigit gitt gttttt itttitt j<: c it gitttgttt gt cttt c cc cttt gciit ic it iggit it iiittc it ic itgtt cctiitttt ittttt gc ictt gigttt itggtttt cttt ggtggiiga t ct it it ct it itt ittttict ctttt ctt c gt c gt c ittt it igt it itt it it it it gc ic i<:j<: ic i<:j<: ic ct it it gt it igcť: iiiit it it igidit ggitctt gagut c itttttttt gt ittctttt gtt it c iiig^grttt c gicttt gjct c c jgiigiigit iit ctttttigggtttit c ittct c ctt gicttt gc ííííc gt giiitgt iiggcicttt git c gtt gt ictttgtt gcttttt it ic gt it c it iigtt ííc iitgctt c ct c gt igiitt gc ííííc ittt gt ggic c gt gittt ic itgict gigct ctttt c igt ggctt cttt gc igc tt :tiitc iigicig ct gtt c ct ct íííííc git cgc c gtt ct iggt iit ctt gcc itt ctt gic gigt ctt git cttti t it iigggit c icgigit ic ic gt itt iitt itt itttttttttttttt gcttttt gtggtt it íc iigttc ict c iiAT &AT G&T &AAA TT &T&GCTT C ACETCC A ATT &T &&T CTTTT &C&T C CT > iitt ct gcttt cttt cttct iiittit ic git gitt ctic itttct ict c ;gttttt c iiit git it iitt itt gtgt gt itit cicccitt citgt it ittt itt giiiiit itig&C ATT C CT &&T&&TT &TTTT C &A AT CT C AAATT &&C&AAC AAC AAC &&A&AAC CTA&T C AAA&A&&T C&CTT C ATTT AC C &AA&AT CTC C&&ACAA&T CT A&TTT C G&A&ATT i-AAAATTT AC AT AT &C C A AG-AC AAACTT AT CTAC &AT C G&TTTAGC &A&A&TT AT A&ATT CTT AT AT C AC CAAC AAC &AC ACT G&TTTT A AAC AC AG-gtt gttiiiictiitt ic it iiittc iitt itt ctt igtt itt itctt iggitt igttt gigttit it citt iict it iit t gtt gtt gtt gttitt ittgtt ctt cigAT CGCACCATTGTT GTTT GT AGCTT ATT C AiC GAT C CTT C AAGT CT C AC AAGTTT CGT AC AT GGT CT CAT GTTTTCTT AC ATT GC AGAAT C AAAC AC AAGTGT C GCTGTTTTT GC C AATTC CT C GT CGAATT CAAGT C GT&&AGACT AC ACT AAAC C GT &&AT C A&TT AACT GGT C &TCTT AAC&&&AACTC AAC&AAAT CT C A&T C &TTA&AT GT AAC C C ATAC A&ATT&&TT C CAA&C AG T AACT AC ACT AC A&C CTTT&T A&&AAC &A&CTT &&&A&&A&AA&AT AAC &A&ACT CT AiT AC A&A&C &T &&TT A&CTT&T AC A&C AA&AA T CTTT A&&&TTT CC &&TT AA&ACTTTAAC C &AA&TTTT &AAC A&TTT &AAT CT AC AC &&C &AA&A&CTTT AC AT &T &&AC AAA&&AC &&G TT C &T &AA&&TT CACT &AAT &ATT CTTT CTT CAT CT C C AAT &&CTC &ATTT &CTT C &&T A&A&AAT C &AACT C C CT CT&&T CT CAAT &C A TT &C A&TT C C A&T&&CT AC&A&&T &&A&AT C AAAA&ATTAA&AT AC C AA&CTTTTT &CT CT &TT ATT &AA&TTT C &&&C &TT C CT CT > ic it ittt c icttt git gcigt iiiiit gc itc gictt gtt gtttct c igctt ctt c ciit ggtttttttttt gc c igA&AT ACAC ACTC Sequence File upload:e Species:o|AihalianaES-3.0 Model description Output Options Email Address: (required for graphical output or sequences longer than 400000 bpX I B Generate PDF graphics (screen) H Generate PostScript graphics (ernail)o H Print GeneMark 2.4 predictions in addition to GeneMark.hmm predictions* H Translate predicted genes into protein* Default I Start GeneMark .hmm | GeneMark Result of last submission: View PDF Graphical Output GeneMar k hmni Listing Go to: Ge re Mark hmm Protein Translations Go to: Job Submission Eukariotyc G-eneMark .hnun vtrsion bp 3.9 ^ril 25, 2008 Sequence name: CKI1 Sequence length: 5043 bp (yt-C content: 38.73* Flat rices file: /home/genmark/ euk_ghm.matr ices/ athal iarLa_hmm3. Omod Thu Oct 1 11:09:24 2009 Predicted genes/ exons Gene Ex on Stijnd Exon S # Type E::orL Lerigth. Stirt^End 1 1 Init i al 969 1025 57 1 3 - - 1 2 1155 1394 240 1 1 + Internal 151£ 2175 eeo 1 4 + Internal 2266 2644 379 1 5 + Internal £734 3317 584 1 6 + Internal 3397 4629 1233 1 7 + Terminal 4709 4921 213 40 ^CEITEC GeneMark Result of last submission: View PDF Graphical Output GeneMaikhmm Listing Go to: GeneMaikhmm Protein Translations Go to: Job Submission Eultariotyc teneMark .lattni vtrsion bp 3.9 ^ril 25, 2008 Sequence name: CKI1 Sequence length: 5043 bp Of C content: 38.73* Matrices file: / home/geroinark/ eTik_ghm.matr ices/ athal iana_hmm3_ Omod Thu Oct 1 11:09:24 2009 GeneMark.hmm prediction Thu Nm 10 03.23;47 EST 20rÄ 0rder 5i window 9e step 12, 4/6 Predicted genes/exons Ex on Strand Exon Exon Rang ■ Exon Start/ End i » Type Length Frame 1 1 + In it ial 969 Id £5 57 1 3 - - 1 2 + Internal 11ÍÍ 1394 2 40 13- 1 3 + Internal 151£ 2175 5 50 13- 1 4 + Internal ££66 2644 3 7 9 11- 1 .5 + Internal 2 7 34 3 317 .5 8 4 2 3- 1 £ + Internal 3397 4629 12 3 3 IS- 1 7 + Terminal 4109 4921 213 IS- — 1 II 111 1 1 l\ 1 1 1,1 1,Ji 1 1 ,1 A , a ■—■—.i -i ' I _L 4-:. .: J_l_J_l_J_L^_l_I_l_J_h i IA I , I_l_L ■ I - _L_L 5200 5600 Nucleotide Position 41 Genomic Homologies ■ Searching for genes according to homologies with known sequences ■ Comparison with EST databases □ BLASTN (http://www.ncbi.nlm.nih.gov/BLASTA http://workbench.sdsc.eduA ■ Comparison with protein databases □ BLASTX (http://www.ncbi.nlm.nih.gov/BLAST/, http://workbench.sdsc.eduA □ Genewise (http://www.ebi.ac.uk/Wise2/) They compare protein sequence with genomic DNA (after reverse transcription), therefore the aminoacid sequence is needed ■ Comparison with homologous genome sequences from related species □ VISTA/AVID (http://www.lbl.gov/Tech-Transfer/techs/lbnl1690.html) ££2>(=EEITEEC= Outline Identification of Genes Ab Initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology Genomic Colinearity Genomes of related species (despite large differences) are characterized by similarities in sequence organization -> possibility to use this information for identification of genes in related species when searching in databases General scheme of work while applying genomic colinearity (also called ..comparative genomics") for experimental identification of genes in related species: □ Mapping small genomes using low-copy DNA markers (e.g. RFLP) □ Using these markers for identification of orthologous genes (genes with the same or similar function) of related species □ Small genome (e.g. rice, 466 Mbp) can be used as a guide: molecular low-copy markers (e.g. RFLP) bound to gene of interest are identified and these regions are then used as a probe for searching in BAC libraries during identification of orthologous regions of large genomes (e.g. barley: 5 Gbp, or wheat: 16 Gbp) Genomic Colinearity ! 40 kb Maize (2 5 00 Mbp) A-► j Rice (400 Mbp) 20 kb Hexaploid wheat (16 000 Mbp) ] Barley (5000 Mbp) □ Rice (400 Mbp) ► High gene density Feuillet and Keller, 2002 A Genomic Colinearity Can be mostly used for the species of grass (e.g. using related genes of species of barely, wheat, rice, maize) Small genome reorganizations (deletions, duplications, inversions, translocations smaller than a few cM) are then detected by detailed sequentional comparative analysis During evolution there's occured some divergencies in related species, mostly in non-coding regions (invasion of retrotransposons etc.) Maize (2500 Mbp) ^^[-^^ Hexaploid wheal (16000 Mhp) j Bark> (5(=EEITEEC= Methylation Filtration ■ Preparation of gene-enriched libraries by technology of methylation filtration ■ Scheme of work during preparation of BAC genome libraries using methylation filtration: □ preparation of genomic DNA without addition of organelle DNA (chloroplasts and mitochondria) fragmentation of DNA (1-4 kbp) and ligation of adaptors preparation of BAC libraries in mcrBC+ strain of E. coli □ selection of positive clones ■ Limitied usage: enrichment of coding DNA only approx. 5-10 % ^CEITEC Outline Experimental Genes Identification ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries EST Libraries Preparation of EST libraries I* Isolation of mRNA I J □Mil Reverse transcription |H. -2* Ligation of linkers and ["ll J synthesis of second cDNA m/M' _■ Cloning into suitable bacterial ^^^^p vector Transformation into bacteria and isolation of DNA* (amplification of DNA) Sequencing using ^^^^ primers specific for used plasmid Saving the results of I sequencing into ' public database ggatgctaatatgggggttatacaagtgtt TTTTTTTTTT1 " AAAAAAAAAA Základy genomiky II, Identifikace genů Outline Forward and Reverse Genetics Approaches ■ Differences between the approaches used for identification of genes and their function Identification of Genes Ab Initio ■ Structure of genes and searching for them ■ Genomic colinearity and genomic homology Experimental Genes Identification ■ Constructing gene-enriched libraries using methylation filtration technology ■ EST libraries ■ Forward and reverse genetics Discussion 54 ££2>(=EEITEEC=