Week 2 : Sequence analysis Introduction to Bioinformatics (LF:DSIB01) Adobe Systems Genomics – Central Dogma of MolBio 2 Central Dogma of Molecular Biology Transcription / 70s reverse transcription Translation Genomics, Transcriptomics, Proteomics --- DNA structure 4 Base Pairs Complementarity Adobe Systems Chromosomes & Genomic Loci 3 A genomic locus is a POSITION on a chromosome or other genomic reference. We often denote a locus by genome assembly, chr, start position, end position and strand. A locus IS NOT A SEQUENCE even though a sequence might be associated with a locus. Chromosomes Numbers, X, Y Telomere, Centromere --- Strand 5 prime, 3 prime Upstream, downstream --- Definition of a genomic locus Adobe Systems Genes •What is a gene? ‒Classical Period •Mendel 1866 : Zellelemente (cell elements) : some factors that determine heredity •Johannsen 1909: Coins the word Gene : some kind of calculating element •1920s: Genes linked to chromosomes and grouped by heredity •Muller 1926: Gene is the basis of evolution ‒NeoClassical Period •1940s: Genes have internal structure, can be dissected by recombination (1 dimension) •1950s: Structure of DNA •1960s: A gene is a discrete sequence that encodes a polypeptide (through RNA step) •1960s-80s: cistrons; One cistron = one polypeptide ‒Modern Period •1986 : Alternative Splicing •1987: Multiple transcription initiation sites (One gene = many transcripts) •1990s: Gene Editing (the RNA can be changed) •2000s: Gene Sharing (one genomic locus can produce several widely different products) • • • • • ‒ ‒ ‒ ‒ 4 The evolving definition of the term Gene (Portin and Wilkins 2017) Evolving definition of the word Gene. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5378099/pdf/1353.pdf Adobe Systems Genes 5 The evolving definition of the term Gene (Portin and Wilkins 2017) Adobe Systems Genes •The “Gene” cannot be simply defined •There is no clear cut hereditary unit that acts autonomously -A Gene is a DNA sequence (not necessarily contiguous) that specifies one or more sequence-related RNAs/proteins that are involved with some Gene Regulatory Network. (this definition pushes the ‘hard’ question to Network) -A Gene is a genomic locus that produces RNAs that have been annotated as connected to each other by function or heredity (this definition might fail to include genes split across loci and is based on external annotation thus subjective) - -As a working definition of a Gene: ‒A Genomic Locus that produces related* Transcripts ‒ ‒ *Related implies an Annotation 6 Adobe Systems Transcripts -> Genes 7 3’ 5’ 5’UTR 3’UTR Exon Intron 5’ 5’UTR 3’UTR Exon Intron 5’ 5’UTR 3’UTR Exon Intron Gene Gene is a locus that produces (related) Transcripts < Incomplete working definition RNA Transcripts Intron Exon Exon Adobe Systems 8 Adobe Systems Genomics and Transcriptomics •Genomics is the scientific discipline that studies heredity, genes, and genomes • •Transcriptomics is the scientific discipline that studies RNA • •Both disciplines share techniques, analyses, and practical uses. • •Genomic terms will be often used as a shortcut for talking about transcripts. 9 Adobe Systems Encoding Genomic Information for Bioinformatics Use •Location Based Formats (.bed) • •Count/Coverage Based Formats (.bedgraph .wig) • •Feature Based Formats (.gtf) • •Sequence Based Formats (.fasta .fastq) • •Multiple Alignment Files • •Alignment Based Formats (.sam) • 10 Adobe Systems 11 Adobe Systems BED (Browser Extensible Data) file format 12 -Tab Delimited Text File -Number of columns consistent per line -No Empty fields (some can have “.” as N/A) BED 3 columns : chrom, chromStart, chromEnd BED 6 columns : BED3 + name, score, strand BED 12 columns : BED6 + thickStart, thickEnd, itemRGB, blockCount, blockSizes, blockStarts https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Adobe Systems 13 BED 3 columns : chrom, chromStart, chromEnd BED 6 columns : BED3 + name, score, strand BED 12 columns : BED6 + thickStart, thickEnd, itemRGB, blockCount, blockSizes, blockStarts Adobe Systems BED pros and cons •Generic •Human Readable •Useful for simple genomic loci • •Awkward handling of splice events • •Not useful for variable scores •Must repeat BED3 info every line 14 Adobe Systems bedGraph •Used to display continuous data •Header “browser” + Header “track” + Sorted BED lines (chr, start, stop, score) • • • • 15 Adobe Systems wiggle file (.wig) •Comes in two flavors: variablestep vs fixed step • • 16 Adobe Systems 17 Adobe Systems Wig pros and cons •Compact •Wide variety of values • •Difficult for human readability •Difficult to add/subtract lines 18 Adobe Systems General transfer format (.gtf) •Also commonly known as .gff (general feature format) •1. seqname (chr) •2. source (program generating seq) •3. feature (what is it? “CDS”, “exon”, “enhancer” etc) •4. start (position) •5. end (inclusive) •6. score (can be any float. Ideal: between 1-1000 for UCSC browser) •7. strand (“+”, “-”, “.”) •8. frame (for exons, frame is between 0-2 representing open reading frame, else “.”) •9. group; list of attributes (gene_id, transcript_id etc) 19 Adobe Systems 20 Adobe Systems Fasta & Fastq 21 Adobe Systems Fastq Quality Scores 22 Adobe Systems Multiple Alignment File 23 Adobe Systems Sequence Alignment Map (.sam) •https://www.samformat.info/ • 24 Adobe Systems Sequence Alignment Map (.sam) 25 .bam format (Binary SAM) Adobe Systems Encoding Genomic Information for Bioinformatics Use •Location Based Formats (.bed) • •Count/Coverage Based Formats (.bedgraph .wig) • •Feature Based Formats (.gtf) • •Sequence Based Formats (.fasta .fastq) • •Multiple Alignment Files • •Alignment Based Formats (.sam) • 26 Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems Adobe Systems 27 www.ceitec.eu CEITEC @CEITEC_Brno Panos Alexiou panagiotis.alexiou@ceitec.muni.cz Office: …. Hours: …. Thank you for your attention! 60 minutes lunch break. >