Genome information resources Bioinformatics - lectures Introduction Information networks Protein information resources Genome information resources DNA sequence analysis Pairwise sequence alignment Multiple sequence alignment Secondary database searching Analysis packages Protein structure modelling Genome information resources primary DNA sequence databases specialised DNA sequence databases Primary DNA sequence databases EMBL DDBJ GenBank dbEST GSDB Store DNA sequences and annotations. Primary protein sequence databases EMBL - European Molecular Biology Laboratory ** European Bioinformatics Institute (EBI) ** collaboration with DDBJ and GenBank - exchange of new entries on daily basis ** source of sequences: direct author submissions, genome projects, scientific literature, patents ** rate of growth is exponential with doubling time ~9-12 months *■ most entries from model organisms ** retrieval through SRS Primary protein sequence databases DDBJ - DNA Data Bank of Japan ** National Institute of Genetics ** collaboration with EMBL and GenBank >- retrieval through DBGet GenBank ** National Center for Biotechnology Information (NCBI) » collaboration with DDBJ and EMBL >- data split into 17 divisions ** retrieval through Entrez Codes for 17 divisions of GenBank Division Sequence subset PRI Primate ROD Rodent MAM Other mammalian VRT Other vertebrate INV Invertebrate PLN Plant, fungal, algal BCT Bacterial RNA Structural RNA VRL Viral PHG Bacteriophage SYN Synthetic UNA Unannotated EST EST (Expressed Sequence Tags) PAT Patent STS STS (Sequence Tagged Sites) GSS GSS (Genome Survey Sequences) HTG HTG (High Throughput Genomic Sequences) LOCUS DEFINITION ACCESSION NID KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL MEDLINE COMMENT FEATURES mRNA gene CDS BASE COUNT ORIGIN 1 61 121 181 241 301 361 421 481 541 601 DRODPPC 4001 bp D.melanogaster M30116 gl57291 mRNA INV 15-MAR-1990 complex (DPP-C), complete cds D.melanogaster, cDNA to mRNA. Drosophila melanogaster Eukaryotae; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. 1 (bases 1 to 4001) Padgett,R.W., St Johnston,R.D. and Gelbart,W.M. A transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family Nature 325, 81-84 (1987) 87090408 The ion codon could be at either 1188-1190 or 1587-1589. Loca t i on/Qua1i f iers 1..4001 /organism«"Drosophila melanogaster" /db_xref="taxon:7227" <1..3918 /gene="dpp" /note="decapentaplegic protein mRNA" /db_xref="FlyBase:FBgn0000490" 1..4001 /note="decapentaplegic" /gene="dpp" /allele="" /db_xref="FlyBase:FBgn0000490" 1188..2954 /gene="dpp" /note="decapentaplegic protein (1188 could be 1587)" /codon_start=l /db_xref="FlyBase:FBgn0000490" /db_xrefÄ"PID:gl57292H /translation="MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR LGYDA YYCHGKC PF PLADHFNSTNHAWQTLVNNMNPGKVPKACCVPTQLDS VAML YL NDQSTWLKNYQEMTWGCGCR" 1170 a 1078 c 956 g 797 t gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgccaaaag atctccgtgc ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc agctgggrtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa // 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g Primary protein sequence databases dbEST National Center for Biotechnology Information (NCBI) maintains only Expressed Sequence Tag (EST) data GSDB - Genome Sequence DataBase ** National Center for Genome Resources ** complete collections of DNA sequence for genome-sequencing laboratories ** on-line submission of large-scale data ** quality checks ► format consistent with GenBank + GSDBID Specialised DNA sequence databases SGD UniGene TDB ACeDB Store species-specific and technique-specific DNA sequences. Specialised DNA sequence databases SGD - Saccharomyces Genome Database ** molecular biology and genetics of S. cerevisiae *■ complete genome, genes, proteins, phenotypes *■ first eukaryotic genome sequenced (1998) >- sequence analysis, register of genes, 3D structural data, primer sequences for cloning UniGene ** collection of genes encoding proteins (transcript map) ** non-redundant; derived from G en Ban k >- data organised in clusters (1 cluster = 1 unique gene) >■ gene-mapping projects and gene expression analysis Specialised DNA sequence databases TDB - 71GR Database ** suite of databases: DNA and protein sequences, gene expression, protein families, taxonomie data ** links: TIGR microbial genome sequencing projects, parasite databases, gene index projects, A. thaliana database, human genomic dataset ACeDB - A Cernorhabditis eiegans DataBase ** C. eiegans genome project ** restriction maps, gene structural information, cosmid maps, sequence data, bibliographic information ** software to organise data ACEDB: CGI script and perl ;EDS 4.S C elegans 2/9S 4-t [in Class:T Other^Locut íeorran^ emant UUlo itrain iene_ClaEf iob^l Search: Ready Sequence_flvaP Author Clono Paper Clone_Grid Sequence E*pr_patter Cell Pathway K«y$*t Model í IWew..J KJh^jeJ goow lnl {woom .jp*, l*r;.ap dat ta*.») TCMUT5C 7CZYG9 *rl"5 9;RcBl? P.Carterre-rsc Lisa Matthews I, Mori "RcC?;RcAl; RcDl; Rc35 J.Murta«h RcbíTkcAie* TCrtJTBS ■£-41 ithutw Mf-M TÍMJT« PcAl rol-6 s-cyp-Jl aromatic ami alao'LBl hero rl-4 £oom in Mid^mouse button: touch to reccnter, drag vertically to zoom, r.oom o-jt --------------