Adobe Systems Adobe Systems Adobe Systems Week 3 : Filetypes and Browser Introduction to Bioinformatics (LF:DSIB01) Adobe Systems Encoding Genomic Information for Bioinformatics Use •Location Based Formats (.bed) • •Count/Coverage Based Formats (.bedgraph .wig) • •Feature Based Formats (.gtf) • •Sequence Based Formats (.fasta .fastq) • •Multiple Alignment Files • •Alignment Based Formats (.sam) • 2 Adobe Systems 3 Adobe Systems BED (Browser Extensible Data) file format 4 -Tab Delimited Text File -Number of columns consistent per line -No Empty fields (some can have “.” as N/A) BED 3 columns : chrom, chromStart, chromEnd BED 6 columns : BED3 + name, score, strand BED 12 columns : BED6 + thickStart, thickEnd, itemRGB, blockCount, blockSizes, blockStarts https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Adobe Systems 5 BED 3 columns : chrom, chromStart, chromEnd BED 6 columns : BED3 + name, score, strand BED 12 columns : BED6 + thickStart, thickEnd, itemRGB, blockCount, blockSizes, blockStarts Adobe Systems BED pros and cons •Generic •Human Readable •Useful for simple genomic loci • •Awkward handling of splice events • •Not useful for variable scores •Must repeat BED3 info every line 6 Adobe Systems bedGraph •Used to display continuous data •Header “browser” + Header “track” + Sorted BED lines (chr, start, stop, score) • • • • 7 Adobe Systems wiggle file (.wig) •Comes in two flavors: variablestep vs fixed step • • 8 Adobe Systems 9 Adobe Systems Wig pros and cons •Compact •Wide variety of values • •Difficult for human readability •Difficult to add/subtract lines 10 Adobe Systems General transfer format (.gtf) •Also commonly known as .gff (general feature format) •1. seqname (chr) •2. source (program generating seq) •3. feature (what is it? “CDS”, “exon”, “enhancer” etc) •4. start (position) •5. end (inclusive) •6. score (can be any float. Ideal: between 1-1000 for UCSC browser) •7. strand (“+”, “-”, “.”) •8. frame (for exons, frame is between 0-2 representing open reading frame, else “.”) •9. group; list of attributes (gene_id, transcript_id etc) 11 Adobe Systems 12 Adobe Systems Fasta & Fastq 13 Adobe Systems Fastq Quality Scores 14 Adobe Systems Multiple Alignment File 15 Adobe Systems Sequence Alignment Map (.sam) •https://www.samformat.info/ • 16 Adobe Systems Sequence Alignment Map (.sam) 17 .bam format (Binary SAM) Adobe Systems Encoding Genomic Information for Bioinformatics Use •Location Based Formats (.bed) • •Count/Coverage Based Formats (.bedgraph .wig) • •Feature Based Formats (.gtf) • •Sequence Based Formats (.fasta .fastq) • •Multiple Alignment Files • •Alignment Based Formats (.sam) • 18