CG920 Genomics Lesson 1 Introduction into Bioinformatics Jan Hejátko Functional Genomics and Proteomics of Plants, Mendel Centre for Plant Genomics and Proteomics, Central European Institute of Technology (CEITEC), Masaryk University, Brno hejatko@sci.muni.cz, www.ceitec.muni.cz  Syllabus of the course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources  PRIMARY, SECONDARY and STRUCURAL databases  GENOME resources  Analytical tools  Homologies searching  Searching of sequence motifs, open reading frames, restriction sites…  Other on-line genomic tools Outline 2 Course Syllabus  Chapter 01  Introduction to bioinformatics  Chapter 02  Identification of genes  Chapter 03  Reverse genetics approaches  Chapter 04  Forward genetics approaches 3 Course Syllabus  Chapter 05  Funcional genomics approaches  Chapter 06  Protein-protein interactions and their analysis  Chapter 07  Current DNA-sequencing methods  Chapter 08  Structural genomics 4 Course Syllabus  Chapter 09  Localization of genes and gene products in the cell  Chapter 10  Genomics and systems biology  Chapter 11  Practical aspects of functional genomics  Chapter 12  Tools of systems biology  Model organisms, PCR and PCR primer design 5  Literature sources for Chapter 01:  Bioinformatics and Functional Genomics, 2009, Jonathan Pevsner, Willey-Blackwell, Hobocken, New Jersey http://www.bioinfbook.org/index.php  Úvod do praktické bioinformatiky, Fatima Cvrčková, 2006, Academia, Praha  Plant Functional Genomics, ed. Erich Grotewold, 2003, Humana Press, Totowa, New Jersey Literature 6  Syllabus of this course  Definition of genomics Outline 7  Sensu lato (in the broad sense) – it is interested in STRUCTURE and FUNCTION of genomes  Sensu stricto (in the narrow sense) – it is interested in FUNCTION of individual genes – FUNCTIONAL GENOMICS  It uses mainly the reverse genetics approaches  Condition: knowing the genome (sequence) – work with databases GENOMICS – What is it? 8 Genomics is a science discipline that is interested in the analysis of genomes. Genome of each organism is a complex of all genes of the respective organism. The genes could be located in cytoplasm (prokaryots) nucleus (in most euckaryotic organisms), mitochondria or chloroplasts (in plants). The critical prerequisite of genomics is the knowledge of gene sequences. Functional genomics is interested in function of individual genes. 9 3 : 1 Forward („classical“) genetics approaches Reverse genetics approaches ? Insertional mutagenesis 5‘TTATATATATATATTAAAAAATAAAATAAAA GAACAAAAAAGAAAATAAAATA….3‘ GENOMICS – What is it? The role of BIOINFORMATICS in FUNCTIONAL GENOMICS BIOINFORMATICS FUNCTIONAL GENOMICS 10 With the knowledge of gene sequences (or the knowledge of the gene files in the individual organisms, i.e. the knowledge of genomes), Reverse Genetics appears that allows study their function. In comparison to ”classical” or Forward Genetics, starting with the phenotype, the reverse genetics starts with the sequence identified as a gene in the sequenced genome. The gene identification using approaches of Bioinformatics will be described later (see Lesson 02). Reverse genetics uses a spectrum of approaches that will be described in the Lesson 03 that allow isolation of sequence-specific mutants and thus their phenotype analysis. The necessity of having phenotype alterations in the forward genomics approach introduces important difference between those two approaches. Thus, the gene is no longer understood as a factor (trait) determining phenotype, but rather as a piece of DNA characterized by the unique string of nucleotides. i.e. physical DNA molecule. 11 • Syllabus of this course • Definition of genomics • Role of BIOINFORMATICS in FUNCTIONAL GENOMICS Outline 12  Definiction of bioinformatics (according to NIH Biomedical Information Science and Technology Initiative Consortium) Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Bioinformatics 13 NIH WORKING DEFINITION OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, July 17, 2000 The following working definition of bioinformatics and computational biology were developed by the BISTIC Definition Committee and released on July 17, 2000. The committee was chaired by Dr. Michael Huerta of the National Institute of Mental Health and consisted of the following members: Bioinformatics Definition Committee BISTIC Members Expert Members Michael Huerta (Chair) Gregory Downing Florence Haseltine Belinda Seto Yuan Liu Preamble Bioinformatics and computational biology are rooted in life sciences as well as computer and information sciences and technologies. Both of these interdisciplinary approaches draw from specific disciplines such as mathematics, physics, computer science and engineering, biology, and behavioral science. Bioinformatics and computational biology each maintain close interactions with life sciences to realize their full potential. 14 Bioinformatics applies principles of information sciences and technologies to make the vast, diverse, and complex life sciences data more understandable and useful. Computational biology uses mathematical and computational approaches to address theoretical and experimental questions in biology. Although bioinformatics and computational biology are distinct, there is also significant overlap and activity at their interface. Definition The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations. Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. 15 • Interface of biology and computers • Analysis of proteins, genes and genomes using computer algorithms and computer databases • Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects. What is bioinformatics? J. Pevsner, http://www.bioinfbook.org/index.php 16  Bioinformatics in functional genomics  Processing and analysis of sequencing data  Identification of reference sequences  Identification of genes  Identification of homologs, orthologs and paralogs  Correlation analysis of genomes and phenotypes (incl. human)  Processing and analysis of transcriptional data  Transcriptional profiling using DNA chips or next-gen sequencing  Evaluation of experimental data and prediction of new regulations in systems biology approaches  Mathematical modelling of gene regualtion networks Bioinformatics 17  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources Outline 18 Spectre of on-line resources 19 There are many of on-line resources that could be used.  EBI http://www.ebi.ac.uk/services Spectre of on-line resources 20  NCBI http://www.ncbi.nlm.nih.gov/ Spectre of on-line resources 21 Nowadays, the resources are interconnected and could be accessed via dedicated web pages.  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources  PRIMARY, SECONDARY and STRUCURAL databases Outline 22  EMBL  http://www.ebi.ac.uk/embl/  GenBank,  http://www.ncbi.nih.gov/Genbank/GenbankSearch.html  DDBJ,  http://www.ddbj.nig.ac.jp  They include sets of primary data – DNA and protein sequences  Sequences in databases of „The Big Three“:  Daily mutual exchange and backup of data  Works with large amount of data (capacity and software requirements)  September 2003 27,2 x 106 entries (approx. 33 x 109 bp)  August 2005 100 x 109 bp from 165.000 organisms Primary databases 23 Growth of GenBank Year BasepairsofDNA(millions) Sequences(millions) 1982 1986 1990 1994 1998 2002 J. Pevsner, http://www.bioinfbook.org/index.php 24 Growth of GenBank + Whole Genome Shotgun (1982-November 2008): we reached 0.2 terabasesNumberofsequences inGenBank(millions) BasepairsofDNAinGenBank(billions) BasepairsinGenBank+WGS(billions) 0 20 40 60 80 100 120 140 160 180 200 1982 1992 2002 2008 J. Pevsner, http://www.bioinfbook.org/index.php 25 Growth of GenBank Feb 15 2013 26 WGS Interactive concepts in biochemistry, Rodney Boyer, Wiley, 2002, http://www.wiley.com//college/boyer/0470003790/ 27 Shotgun sequencing allows a scientist to rapidly determine the sequence of very long stretches of DNA. The key to this process is fragmenting of the genome into smaller pieces that are then sequenced side by side, rather than trying to read the entire genome in order from beginning to end. The genomic DNA is usually first divided into its individual chromosomes. Each chromosome is then randomly broken into small strands of hundreds to several thousand base pairs, usually accomplished by mechanical shearing of the purified genetic material. Each of the short DNA pieces is then inserted into a DNA vector (a viral genome), resulting in a viral particle containing "cloned" genomic DNA (Fig. 1). The collection of all the viral particles with all the different genomic DNA pieces is referred to as a library. Just as a library consists of a set of books that together make up all of human knowledge, a genomic library consists of a set of DNA pieces that together make up the entire genome sequence. 28 Placing the genomic DNA within the viral genome allows bacteria infected with the virus to faithfully replicate the genomic DNA pieces. Additionally, since a little bit of known sequence is needed to start the sequencing reaction, the reaction can be primed off the known flanking viral DNA. In order to read all the nucleotides of one organism, millions of individual clones are sequenced. The data is sorted by computer, which compares the sequences of all the small DNA pieces at once (in a "shotgun" approach) and places them in order by virtue of their overlapping sequences to generate the full-length sequence of the genome (Fig. 2). To statistically ensure that the whole genome sequence is acquired by this method, an amount of DNA equal to five to ten times the length of the genome must be sequenced. (Interactive concepts in biochemistry, Rodney Boyer, Wiley, 2002, http://www.wiley.com//college/boyer/0470003790/) 29 Arrival of next-generation sequencing: In two years we have gone from 0.2 terabases to 71 terabases (71,000 gigabases) (November 2010) J. Pevsner, http://www.bioinfbook.org/index.php 30 DDBJ/EMBL/GenBank accepts both complete and incomplete genomes. Whole Genome Shotgun (WGS) sequencing projects are incomplete genomes or incomplete chromosomes that are being sequenced by a whole genome shotgun strategy. WGS projects may be annotated, but annotation is not required. The pieces of a WGS project are the contigs (overlapping reads), and they do not include any gaps. An AGP file can be submitted to indicate how the contig sequences are assembled together into scaffolds (contig sequences separated by gaps) and/or chromosomes. We must have the contig sequences without gaps as the basic units for all WGS projects. 31  They include sets of primary data – DNA and protein sequences  Protein sequences:  PIR, http://pir.georgetown.edu/  MIPS, http://www.mips.biochem.mpg.de  SWISS-PROT, http://www.expasy.org/sprot/ Primary databases 32  Standard nucleotide sequences acquired by high quality sequencing  Types of sequences in primary databases  ESTs (Expressed Sequence Tags)  HGTS (High Throughput Genome Sequencing) - Results of sequencing projects without annotation  Reference sequences of annotated genomes  TPAs (Third Party Annotation) - sequences annotated by third party (by someone else, not the orginal authors) Primary databases 33 GenBank (NCBI) http://www.ncbi.nlm.nih.gov/ Primary databases 34 Primary databases 35 Primary databases 36 Accession number Primary databases 37 Primary databases 38 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 GenBank genomic DNA sequence NT_030059 Genomic contig Rs7079946 dbSNP (single nucleotide polymorphism) N91759.1 An expressed sequence tag (1 of 170) NM_006744 RefSeq DNA sequence (from a transcript) NP_007635 RefSeq protein AAC02945 GenBank protein Q28369 SwissProt protein 1KT7 Protein Data Bank structure record protein DNA RNA Page 27 J. Pevsner, http://www.bioinfbook.org/index.php 39 NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genome NC_###### Complete chromosome NC_###### Genomic contig NT_###### mRNA (DNA format) NM_###### e.g. NM_006744 Protein NP_###### e.g. NP_006735 Page 27 J. Pevsner, http://www.bioinfbook.org/index.php 40 RefSeq 41 Accession Molecule Method Note AC_123456 Genomic Mixed Alternate complete genomic AP_123456 Protein Mixed Protein products; alternate NC_123456 Genomic Mixed Complete genomic molecules NG_123456 Genomic Mixed Incomplete genomic regions NM_123456 mRNA Mixed Transcript products; mRNA NM_123456789 mRNA Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assemblies NW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun data XM_123456 mRNA Automated Transcript products XP_123456 Protein Automated Protein products XR_123456 RNA Automated Transcript products YP_123456 Protein Auto. & Curated Protein products ZP_12345678 Protein Automated Protein products NCBI’s RefSeq project: many accession number formats for genomic, mRNA, protein sequences J. Pevsner, http://www.bioinfbook.org/index.php 42 Primary databases 43 Primary databases 44  PROSITE, http://www.expasy.org/prosite/  Databases of functional or structural motifs, acquired by primary data (sequences) comparison Secondary databases 45  PROSITE, http://www.expasy.org/prosite/  Databases of functional or structural motifs, acquired by primary data (sequences) comparison Secondary databases 46  PROSITE, http://www.expasy.org/prosite/  Databases of functional or structural motifs, acquired by primary data (sequences) comparison Secondary databases 47  Databases of functional or structural motifs, acquired by primary data (sequences) comparison  PRINTS, http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ Secondary databases 48  TRANSFAC http://www.gene-regulation.com/ Secondary databases Scaffold/Matrix Attached Region transaction Database 49 S/MARt DB (scaffold/matrix attached region transaction database). This database collects information about S/MARs and the nuclear matrix proteins that are supposed be involved in the interaction of these elements with the nuclear matrix. http://transfac.gbf.de/SMARtDB/index.html)  PDB http://www.rcsb.org/pdb/ Structural databases 50  PDB http://www.rcsb.org/pdb/ Structural databases 51  PDB http://www.rcsb.org/pdb/ Structural databases Pekárová et al., Plant Journal (2011) 52  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources  PRIMARY, SECONDARY and STRUCURAL databases  GENOME resources Outline 53  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway Genome resources 54 Genome resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway 55 Genome resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway 56 Genome resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway 57 Genome resources  Human Genome Browser http://genome.ucsc.edu/cgi-bin/hgGateway 58 Genome resources  The Arabidopsis Information Resource (TAIR) http://www.arabidopsis.org 59  TAIR, The Arabidopsis Information Resource, http://www.arabidopsis.org Genome resources 60  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources  PRIMARY, SECONDARY and STRUCURAL databases  GENOME resources  Analytical tools  Homologies searching Outline 61  Global versus local alignment  Global alignment: only for sequences, which are similar and of a similar length (BUT can insert spaces into one or both sequences)  Local alignment provides identification and comparison even in case of alignment of regions of sequences with high similarity, e.g. even in case of change of order of protein domains during evolution Cvrčková, Úvod do praktické bioinformatiky  Global alignment is used mainly in case of multiple alignment (CLUSTALW, further in the presentation) Analytical tools 62  Choosing the right type of alignment using dotplot  Plotting the sequences (x and y axis)  Identification of identity in „dot“ of specific size (e.g. 2 bp)  Filtering the diagonals of lengths lower than a treshold Cvrčková, Úvod do praktické bioinformatiky Analytical tools 63  Examples of sequence alignment using dotplot  Global alignment: possible only for sequences A and B  The rest of the sequences underwent change of order of protein domains and therefore it is neccessary to do a local alignment  Dotplot can be obtained using BLAST2 (see further in the presentation) Cvrčková, Úvod do praktické bioinformatiky Analytical tools 64  BLAST http://ncbi.nlm.nih.gov/BLAST/ Analytical tools 65  Word size: 10-11 bp or 2-3 aa  Scoring the homology with matrices PAM (Point Accepted Mutation) or BLOSUM (BLOcks Substitution Matrix)  Primary similarities (seed matches)  Expanding the homology regions to the left and to the right  Showing the results MRKEV [delece] MRKE [záměna] MRKY [inzerce] MRAKY M R . K E V | | | : M R A K Y Matice PAM 250 Cvrčková, Úvod do praktické bioinformatiky BLAST Basic Local Alignment Search Tool 66 E= expectancy value  „expectancy value“ udává předpokládaný počet sekvencí se stejnou nebo lepší podobnosti při vyhledávání ve stejně velké databázi složené z náhodných sekvencí  výsledek udává frakci totožných a u proteinů i podobných pozic, příp. počet vložených mezer BLAST Basic Local Alignment Search Tool 67 Primary databases 68 BLINK is a link to the pre-computed BLAST search results for the respective sequence (see the next slide). BLAST Basic Local Alignment Search Tool 69  Searching according to source (organism) of sequences, e.g. known genomes of microorganisms  Currently there exists a lot of specialized versions of BLAST  BLASTP • Given the protein query, it returns the most similar protein sequences from the protein database.  BLASTN • Given the DNA query, it returns the most similar DNA sequences from the DNA database.  BLASTX • Compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. • Other variants, e.g. MEGABLAST, for identification of identical or very similar sequences (searches long similar regions of nucleotide sequences) BLAST Specialized versions 70  TBLASTN • Compares a protein query against the all six reading frames of a nucleotide sequence database.  TBLASTX • Translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database.  Currently there exists a lot of specialized versions of BLAST BLAST Specialized versions 71  PSI-BLAST (Position-Specific Iterated Blast) • For every alignment, PSI-BLAST creates so-called PSSM (position specific substitution matrix) • PSSM takes into account relative frequency of specific aminoacid residue in a specific position within sequences identified as similar in first step, which can mean functional conservation. • First step: standard BLAST, during which PSI-BLAST identifies a list of similar sequences with E value better than minimal value (standard = 0,005)  Currently there exists a lot of specialized versions of BLAST BLAST Specialized versions 72  PHI-BLAST (Pattern-Hit InitiatedBlast) • Sequence of motif must be inserted using special syntax: • [LVIMF] means either Leu, Val, Ile, Met or Phe • For identification of specific sequence, e.g. motif (pattern) in sequence of similar protein sequences • - is spacer (means nothing) • x(5) means 5 positions in which any residue is allowed • x(3, 5) means 3 to 5 positions where any residue is allowed BLAST Specialized versions  Currently there exists a lot of specialized versions of BLAST 73  Example of search by PHI-BLAST BLAST Specialized versions 74  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources  PRIMARY, SECONDARY and STRUCURAL databases  GENOME resources  Analytical tools  Homologies searching  Searching of sequence motifs, open reading frames, restriction sites… Outline 75  http://workbench.sdsc.edu/ Analytical tools  http://workbench.sdsc.edu/ Analytical tools 77  http://workbench.sdsc.edu/ Analytical tools 78  http://workbench.sdsc.edu/ Analytical tools 79  http://workbench.sdsc.edu/ Analytical tools 80  http://workbench.sdsc.edu/ Analytical tools 81  http://workbench.sdsc.edu/ Analytical tools 82 Analytical tools  VPCR http://grup.cribi.unipd.it/cgi-bin/mateo/vpcr2.cgi 83 Analytical tools  VPCR http://grup.cribi.unipd.it/cgi-bin/mateo/vpcr2.cgi 84  Syllabus of this course  Definition of genomics  Role of BIOINFORMATICS in FUNCTIONAL GENOMICS  Databases  Spectre of „on-line“ resources  PRIMARY, SECONDARY and STRUCURAL databases  GENOME resources  Analytical tools  Homologies searching  Searching of sequence motifs, open reading frames, restriction sites…  Other on-line genome tools Outline 85  TIGR (The Institute for Genomic Research, http://www.tigr.org/software/)  Recently part of the J. Craig Venter Institute Other online genome resources 86  Online Mendelian Inheritance in Man (OMIM) Other online genome resources 87