Lecture 5 : RNA-seq analysis Modern Genomic Technologies (LF:DSMGT01 ) Vojta Bystry vojtech.bystry@ceitec.muni.cz NGS data analysis 2 2 Raw data .fastq Genome/Transcriptome Reference Mapping .bam Interaction analysis CHIP-seq Expression analysis RNAseq Variant analysis WES de-multiplexing Not known reference QC QC > Experiment design Not ”classic” reference Metagenomics Reference assembly Immunogenetic VDJ-genes CRISPR sgRNA Methylation Bisulfide-seq … The RNA-Seq workflow Alignment ●Mapping to genome or transcriptome? ●Genome ○Requires spliced alignment ○Can find novel genes/isoforms/exons ○Information about whole genome/transcriptome ●Transcriptome ○No spliced alignments necessary ○Many reads will map to multiple transcripts (shared exons) ○Cannot find anything new ○Difficult to determine origin of reads (multiple copies of transcripts) Alignment ●Our choice is the STAR aligner ●It performs genome alignment ●Offers a lot of settings to support splicing, soft-clipping, chimeric alignments, ... ●Other techniques (Salmon or Kallisto) do not use alignment per se and can give you the gene count information right away ○They use only transcriptome as a reference and are very quick ○Drawback is you see only what’s in the transcriptome and nothing else Duplication removal - UMI 6 •PCR duplicates •Optical duplicates • •How the tools recognize duplicates ‒Maps to the exact same place •Problem is it could be identical fragment not PCR duplicate •UMI helps ‒Maps to the exact same place ‒AND have identical UMI sequence • Post-alignment QC ●Post-alignment QC gives us information about the mapping ○Number of mapped reads - unique + multi mapped ○Mapped locations ○Duplication rates ○Library strand specificity ○Captured biotypes ○rRNA contamination ○5’ to 3’ end coverage bias ○… Post-alignment QC - Tools ●STAR alignment results - number of mappings and others ●RSeQC - mapped locations (Read Distribution), library strand specificity ●featureCounts biotypes - summary of mappings to gene biotypes ●FastQ screen (not exactly Post QC) - residual content of rRNA, tRNA, general mapping percentage to the genome (if selected) ●Qualimap - general alignment statistics focused on RNA-Seq (rnaseq) including gene body coverage Post-alignment QC - RSeQC ●RSeQC is a general tool for many QC results ●Few of them are ○Read distribution - calculates assignment of reads to different genomic features ○Infer experiment - test strand specificity of the library ○Inner distance - calculates approximate distance between read pairs Post-alignment QC - FastQ_screen ●FastQ screen is a quick scan of potential mapping locations on different references ●We can use it to do a quick scan of contaminations (various organisms) as well as estimate residual rRNA content ○In polyA selection based libraries we expect to have less then 2% rRNA content ○In rRNA libraries we can have up to 10-15% of rRNA and still consider it a good library ●Biobloom other option more computationally expensive Post-alignment QC - Qualimap ●Qualimap performs a numerous checks of the alignment ○One of the modules is rnaseq which is focused directly on RNA-Seq alignments ●One of the main information we can get from this module is the gene body coverage ○We would like to see a nice and even read mapping coverage along the whole length of the genes ○The coverage, however, depends on the library fragmentation (low RIN, FFPE samples but also depends on the used library kit (Lexogen QuantSeq)! Note: Gene body coverage ●Often, libraries with high fragmentation (and low RIN numbers) combined with polyA selection might have strong 3’ end bias ○This is a result of polyA “pulled” fragments ●Some kits, however, target only the polyA tail or sequences close to it ○An example is Lexogen QuantSeq which sequences only one read per mRNA molecule close to polyA tail Note: Gene body coverage II Examples Source: Sigurgeirsson et al. PLoS ONE 2014 Mapping QC 14 15 ●Examples Mapping QC Feature counting ●Now, when we know our alignments are solid we need to get the number of reads mapped to a gene (or other feature) ○From there, we can calculate the differential expression ●The question is, how do we summarize the counts ○Do we want only uniquely mapped reads ○Do we want also multi mapped? And how do we assign them? All? One random? Somehow else? ○And what if we have multiple genes which overlap each other? Strand specific library ●We can basically have three strand specificities ○Non stranded/Unstranded - not very common anymore ■Direction of the read mapping is completely random (50/50) ○Forward (sense) stranded - common for target kits and “bacterial kits” ■Direction of the read mapping is the same as the gene it originates from ○Reverse (antisense) stranded - “default” for Illumina and NEB kits ■Direction of the read mapping is the opposite as the gene it originates from ●In case of paired-end sequencing it’s measure by the first (R1) read orientation (FR, RF) Read to gene assignment Figure 2 | Key criteria for evaluation of strand-specific RNA-seq libraries. (a–d) Categories of quality assessment were complexity (a), strand specificity (b), evenness of coverage (c) and comparison to known transcript structure (d). Double-stranded genome with gene ORF orientation (blue arrows) and UTRs (blue lines) are shown along with mapped reads (black and red arrows, reads mapped to sense and antisense strands, respectively). https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/rb-rnaseq/tutori al.html Feature counting ●The regular settings are - summarize reads mapping to exons (-t exon) and sum them up to gene id (-g gene_id) ● ●Other possibilities: ○Count per exons ○Include introns ○… Gene counts - Tools ●featureCounts is build around the “classic” read to gene assignment ○By default, assigns only uniquely mapped reads an only reads uniquely assignable to a single gene (but both can be changed) ○Gives you raw read counts per gene ●RSEM is efficient in counting also multi mapped reads and can estimate expression of individual gene isoforms ○Tries to “weight” the probability a mapped position of a multi mapped read and assign it correctly to the real source ○Gives you estimated counts per gene as well as per isoform and normalized TPM = Transcripts per million transcripts ● ●But, there is a big differences in the minimal required “good” aligned reads Minimal number of reads and expression I ●RSEM is less precise in low read counts (<40-50M reads) and for low expressed RNAs (difficult to estimate) ●For lower read counts it’s safer to go for featureCounts ●Our best practices for a minimal read count for each tools: ○Less than 40-50M aligned reads (to the good stuff) -> featureCounts ○More than 40-50M aligned reads (to the good stuff) -> RSEM ●But if you want isoforms!!! -> RSEM Source: Roberts et al. Nature Methods 2013 RSEM, Salmon and Kallisto give slightly better estimates of the “real” expression because they don’t discard any mapped read. Because even if the read is multi mapped it’s still real. 22 Feature count results Differential expression ●We have our raw read counts but we need to find the real differences ●We want to figure out the change comparing the before and after treatment ●What are the changed genes? Are there even any? Is there even difference between the samples? And what about the experimental design - paired samples - does it affect the evaluation? ●The tools for the differential expression have to account for different libraries depths, model and “fix” outliers, account for different levels of expressions, and many other things ●Luckily, there are few tools that have all of this and can be used Differential expression - tools ●DESeq2 ○More specific ●edgeR ○More sensitive ●The important part of the calculation is the design ○Assignment of a group/condition to a sample ○If the samples are paired (the same patient twice) we have to account for this as well! ○Technically, the pairing of the samples is a batch effect so it is similar to have a technical noise in your data Pairing of the samples/batch effect ●Paired samples are not the same as paired-end sequencing! ●There is a bad experimental design and a good experimental design ●Very simply - more randomization gives you better results Pairing of the samples/batch effect Supplemental Figure 1: The problem of confounding biological variation and batch effects. The top section depicts a completely confounded study design of processing individual cells from three biological groups (represented by shapes) in three separate batches (represented by colors). In this case, we cannot determine if biology or batch effects drive the observed variation. The bottom section depicts a balanced study design consisting of multiple replicates (rep) split and processed across multiple batches. The use of multiple replicates allows observed variation be attributed to biology (cells cluster by shape) or batch effects (cells cluster by color). ●And example pairing of the patients AND different sequencing years - double batch Pairing of the samples/batch effect 1st figure - no batch effect/pairing of the samples included - we can see that individual sequencing years are separated (left and rigth) and also that individual patient samples are clustered together (even and odd numbers). We don’t see clustering of CXCR4bright and CXCRdim samples together:( 2nd figure - batch effect/pairing of the samples was included in the calculation and voila - samples are nicely clustered according to the condition and not according to the patient pairing and/or year of the sequencing. Without accounting for the batch effect/pairing of the samples we wouldn’t get any reasonable results and what even more - we might get completely wrong results! 28 Differential expression results 29 Count normalisation ●Normalize to: ○Gene size ○Library size ●rpkm - Reads Per Kilobase of transcript per Million mapped reads ●fpkm - Fragments Per Kilobase of transcript per Million mapped reads ●tpm - Transcripts Per Million (TPM) ○for every 1,000,000 RNA molecules in the RNA-seq sample, x came from this gene/transcript ●Never ever use normalized counts for any comparisons ○...except comparing a single gene in a single experiment for the samples ○If you really, really need to use any kind of normalized counts to compare use TPM log2(fold-change) ●Fold-change is usually calculated by average expression of all samples of condition 1 vs average expression of all samples of condition 2 ●Example: a)geneA expression in pre is 5, in post is 10; fold-change of post/pre is 2 = gene is up-regulated 2x b)geneB expression in pre is 10, in post is 5; fold-change of post/pre is 0.5 = gene is down-regulated 1/2x … (O_o) ●Solution: Adding log2 gives us log2(2) = 1, log2(0.5) = -1 ●Nice and even distribution around 0 and clear interpretations log2(fold-change) ●But it might be misleading ●Large log2FC on low-expressed genes are most likely not biologically relevant ●Small log2FC on highly-expressed genes might be biologically relevant ●Example: “Common” cut-off value of fold-change of 2x (log2FC=+/-1) or 1.5x (log2FC=+/-0.58) ○geneA expression in WT is 10 and in KO is 4, log2FC = -1.32 YES (?) ○geneB expression in WT is 1,000,000 and in KO is 500,001, log2FC = -0.99 NO (?) P-value and adjusted p-value ●P-value tries to give you “a number” saying if the differences you are observing are robust and the differences are not “random” between the compared conditions/samples ●Adjusted p-value adds a correction for the multiple testing we are doing - tries to add correction of getting a p-value just by accident ●But is adjusted p-value 0.049 really better than 0.051? ●Number of replicates highly influences the estimates ○The observations might be the same but the statistical significance might be lower How many differentially expressed genes I have? It depends how many you want…:) Selection of the differentially expressed (DE) gene is completely up to you Some people use p-value, some adjusted p-value and some people log2fc and their combinations, some just take top n genes Statistical significance ≠ biological relevance!!! Scientists rise up against statistical significance, Nature 567, 305-307 (2019), doi: 10.1038/d41586-019-00857-9 https://www.nature.com/articles/d41586-019-00857-9 P-value significance 35 ●Example Differential expression output 36 www.ceitec.eu CEITEC @CEITEC_Brno Vojta Bystry vojtech.bystry@ceitec.muni.cz Thank you for your attention! >