Outline Sequencing (NGS) in general Sequencing data analysis in general Kathi Zarnack and Julian König data Results (selected) Galaxy RNA-Seq data analysis Sequencing (NGS) in general PLoS Comput Biol. 2015 Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud. Griffith M, Walker JR, Spies NC, Ainscough BJ, Griffith OL. How do we get the data - process of sequencing, isolation of RNA, iCLIP/RNA-Seq principle (very brief) and all the way to .fastq Sequencing data analysis in general PLoS Comput Biol. 2015 Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud. Griffith M, Walker JR, Spies NC, Ainscough BJ, Griffith OL. Now we have the fastq files so what do we do with it. This is just an example, not the workflow we will use. We need raw reads, annotation and reference genome. Then we can align and/or assemble transcriptome, quantify or find peak and further process the data. Kathi Zarnack data Cell. 2013 Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements. Zarnack K, König J, Tajnik M, Martincorena I, Eustermann S, Stévant I, Reyes A, Anders S, Luscombe NM, Ule J. Summary of the results There are ~650,000 Alu elements in transcribed regions of the human genome. These elements contain cryptic splice sites, so they are in constant danger of aberrant incorporation into mature transcripts. Despite posing a major threat to transcriptome integrity, little is known about the molecular mechanisms preventing their inclusion. Here, we present a mechanism for protecting the human transcriptome from the aberrant exonization of transposable elements. Quantitative iCLIP data show that the RNA-binding protein hnRNP C competes with the splicing factor U2AF65 at many genuine and cryptic splice sites. Loss of hnRNP C leads to formation of previously suppressed Alu exons, which severely disrupt transcript function. Minigene experiments explain disease-associated mutations in Alu elements that hamper hnRNP C binding. Thus, by preventing U2AF65 binding to Alu elements, hnRNP C plays a critical role as a genome-wide sentinel protecting the transcriptome. The findings have important implications for human evolution and disease. Cell. 2013 Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements. Zarnack K, König J, Tajnik M, Martincorena I, Eustermann S, Stévant I, Reyes A, Anders S, Luscombe NM, Ule J. Summary of the paper Highlight - the cryptic exons protection and competition between hnRNP C and U2AF65 Results (selected) RNA-Seq coverage over exons In control, cryptic exons are not expressed In hnRNP C knockdown, cryptic exons are expressed Cell. 2013 Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements. Zarnack K, König J, Tajnik M, Martincorena I, Eustermann S, Stévant I, Reyes A, Anders S, Luscombe NM, Ule J. Highlight coverage of Alu exons Here, at PTS gene but we will show CD55 Goal of the practical Get from the raw sequencing data to the gene expression (RNA-Seq) Analyze RNA-Seq data and get differential gene expression and expression of individual exons (example at gene CD55 gene) Show coverage cryptic exon(s) (example at gene CD55) Do everything in less that half a day Galaxy practical Get the data Or you just load the preloaded data Shared Data -> Data Libraries -> Bi5444 -> RNA-Seq Galaxy practical 1. 2. 3. Galaxy practical Galaxy practical Get the data Galaxy practical Initial quality check Check the raw reads quality Using FastQC tool Input FASTQ, output HTML Galaxy practical Initial quality check Galaxy practical Initial quality check It is still running, right? But without that, we cannot proceed 😞 We have a solution! :) Import Galaxy history Most likely the alignment will take quite some time. Guide them to the sharing of the history in Galaxy. Galaxy practical Initial quality check Import Galaxy history Galaxy practical Initial quality check Import Galaxy history Type CEITEC in the search box. Galaxy practical Initial quality check Import Galaxy history Initial quality check HTML report(s) Galaxy practical You can see we will show you the analysis only on one file but you can do it easily for the rest of them Galaxy practical Initial quality check HTML report(s) But there is too many of them MultiQC - makes you life simpler This time, you can try it on your own! You can see we will show you the analysis only on one file but you can do it easily for the rest of them at home. Galaxy practical Summary of the logs Summarize the output logs Using MultiQC tool Input LOG(s), output HTML Galaxy practical Summary of the logs Galaxy practical Initial quality check - Adapter content If you scroll at the way down you can see some residual adapter content But if you hover over the lines you would see the percentages are very low (you can also see it on y-axis scale) Galaxy practical Read preprocessing Remove adapters and/or trim low-quality ends Using Trimmomatic trimmer Input FASTQ, output FASTQ Galaxy practical Read preprocessing Galaxy practical Read preprocessing Adapter sequence (partial): >adapter AGATCGGAAGAGC Galaxy practical Read preprocessing Galaxy practical Preprocessing quality check Check the preprocessed reads quality and summarize Using FastQC & MultiQC tools Input FASTQ/LOG, output HTML Galaxy practical Preprocessing quality check Please, run the FastQC and MultiQC on the preprocessed files and check the adapter content Galaxy practical Preprocessing quality check Share Data -> Histories -> Bi5444_RNA-Seq_preprocess Galaxy practical Preprocessing quality check Check the preprocessed reads quality & summarize Are all the bad things gone? Galaxy practical Preprocessing quality check Yep, we are nice and clean! Galaxy practical Preprocessing quality check Check the preprocessed reads quality & summarize Are all the bad things gone? Actually, for modern aligners such as STAR, it doesn’t matter that much They can perform soft-clipping Galaxy practical Soft-clipping in alignment Hiding of non-matching parts of the reads Can overcome remaining adapter or low-quality sequences Only to specified limits (in STAR the default is max. 33% of the read length) Soft-clipped part But allowing to much soft-clipping results in ambiguous mapping and possible cross-mapping events For example default minimal aligned length after soft-clipping in BWA-MEM (popular aligner for DNA reads) is only 19 nt! This alignment length is very unspecific and you can easily map bacterial sequences to the human reference genome Galaxy practical Alignment to genome Align RNA-Seq data to a genome Using STAR aligner Input FASTQ, output BAM Alignment to genome Galaxy practical The name of the input file might be different. Sit back, wait and relax The USA is most likely to be waking up right now so the servers are very busy. Alignment to genome Share Data -> Histories -> Bi5444_RNA-Seq_alignment Galaxy practical The name of the input file might be different. Alignment to genome STAR performs well even with defaults Main output is the BAM file This is one of the few files worth to keep and save Galaxy practical The name of the input file might be different. Galaxy practical Quality control of alignment Run MultiQC to assess the alignment Galaxy practical Quality control of alignment Galaxy practical Rename and tags Better names comprehensibility Galaxy practical Rename and tags Better names comprehensibility Galaxy practical Gene counts For the raw gene counts (expression) you need to have a list of genes and their positions in the genome - gene annotation Using UCSC Main table browser Input nothing, output GTF Galaxy practical Gene counts Share Data -> Data Libraries -> Bi5444 -> RNA-Seq -> Ensemble_Homo_sapiens.GRCh38.94.gtf.gz Galaxy practical Gene counts Get the raw gene counts Using featureCounts tool Input BAM and annotation GTF, output TXT (raw gene counts) Galaxy practical Gene counts Galaxy practical Gene counts Quality control of gene counts Again MultiQC Galaxy practical Galaxy practical Differential gene expression Get differential gene expression from the raw counts Using edgeR and DESeq2 tools Galaxy practical Differential gene expression - note Optimally, the experiment should be designed with at least three biological replicates However, if the data are only “supportive” two replicates is enough Galaxy practical Differential gene expression Shared Data -> Histories -> Bi5444_RNA-Seq_DE_start Galaxy practical Differential gene expression Galaxy practical Differential gene expression Galaxy practical Differential gene expression We can see that the samples ale nicely clustering together Galaxy practical Differential gene expression Galaxy practical Gene symbol annotation But we do not see any gene symbol/names which we all like Merge with HUGO information (https://www.genenames.org/) Share Data -> Data Libraries -> Bi5444 -> RNA-Seq -> HUGO Gene information Galaxy practical Gene symbol annotation Merge with HUGO information Galaxy practical Gene symbol annotation Galaxy practical Gene symbol annotation Galaxy practical Differential gene expression Galaxy practical Differential gene expression Galaxy practical Alignment coverage Visualization of coverage of aligned data (and expressed exons) Using deepTools -> bamCoverage Input BAM, output BIGWIG Galaxy practical Alignment coverage Effective genome size (hg38): 2913022398 https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html Note: The versions of the tools changed from 2017 and it might look slightly different. Galaxy practical Alignment coverage BIGWIG coverage Galaxy practical Alignment coverage BIGWIG visualization Galaxy practical Alignment coverage CD55 region Galaxy practical Alignment coverage BIGWIG size Galaxy practical Alignment coverage Now you do it for the other two BIGWIG files There is an antisense Alu present at “exon10” You can actually see there is one more antisense Alu and it is also slightly expressed after hnRNP C knockdown And if you look further the exon is annotated in RefSeq as well (both of them) Galaxy practical Alignment coverage If something went wrong, history of DE and coverage visualization Shared Data -> Histories -> Bi5444_RNA-Seq_DE_full_history RNA-Seq data analysis - pipeline in Galaxy 1.Initial quality check - FastQC ○Check for overall quality of the data, number of reads, read length distribution, ... 2.Preprocessing - Trimmomatic ○Remove adapters, low quality ends, unwanted sequences, … 3.Alignment - STAR ○Map reads to the reference genome 4.Alignment quality check - STAR log, featureCounts ○Check overall alignment statistics 5.Genome coverage (peaks) - bamCoverage ○Get overview of mapped positions in the genome 6.Gene annotation - UCSC Main table browser ○Get gene annotations for reference genome 7.Quantification - featureCounts ○Get gene read counts 8.Differential gene expression - edgeR,DESeq2 (genes) ○Differences between conditions RNA-Seq data analysis - other possibilities 1.Initial quality check - FastQC ○Check for overall quality of the data, number of reads, read length distribution, ... 2.Preprocessing - Cutadapt, BBTools, seqtk ○Remove adapters, low quality ends, unwanted sequences, … 3.Alignment - GSNAP, Bowtie2 ○Map reads to the reference genome 4.Alignment quality check - Picard tools, RSeQC, Qualimap ○Check overall alignment statistics 5.Genome coverage (peaks) - STAR + bedGraphToBigWig, Bedtools ○Get overview of mapped positions in the genome 6.Gene annotation - UCSC, Ensembl, NCBI ○Get gene annotations for reference genome 7.Quantification - RSEM, HTSeq, Salmon, Kallisto ○Get gene read counts 8.Differential gene expression - DEXSeq (exons), baySeq (genes) ○Differences between conditions 9.Gene ontology and pathways - g:Profiler, KEGG ○Check ontologies and pathways for selected genes