Lecture 5 : RNA-seq analysis
Modern Genomic Technologies  (LF:DSMGT01 )
Vojta Bystry
vojtech.bystry@ceitec.muni.cz

NGS data analysis
2
2
Raw data
.fastq
Genome/Transcriptome Reference Mapping
.bam
Interaction
analysis
CHIP-seq
Expression analysis
RNAseq
Variant analysis
WES
de-multiplexing
Not known
reference
QC
QC
>
Experiment design
Not ”classic” reference
Metagenomics
Reference assembly
Immunogenetic
VDJ-genes
CRISPR
sgRNA
Methylation
Bisulfide-seq
…

The RNA-Seq workflow


Alignment
●Mapping to genome or transcriptome?
●Genome
○Requires spliced alignment
○Can find novel genes/isoforms/exons
○Information about whole genome/transcriptome
●Transcriptome
○No spliced alignments necessary
○Many reads will map to multiple transcripts (shared exons)
○Cannot find anything new
○Difficult to determine origin of reads (multiple copies of transcripts)

Alignment
●Our choice is the STAR aligner
●It performs genome alignment
●Offers a lot of settings to support splicing, soft-clipping, chimeric alignments, ...
●Other techniques (Salmon or Kallisto) do not use alignment per se and can give you the gene count
information right away
○They use only transcriptome as a reference and are very quick
○Drawback is you see only what’s in the transcriptome and nothing else

Duplication removal - UMI
6
•PCR duplicates
•Optical duplicates
•
•How the tools recognize duplicates
‒Maps to the exact same place
•Problem is it could be identical fragment not PCR duplicate
•UMI helps
‒Maps to the exact same place
‒AND have identical UMI sequence
•

Post-alignment QC
●Post-alignment QC gives us information about the mapping
○Number of mapped reads - unique + multi mapped
○Mapped locations
○Duplication rates
○Library strand specificity
○Captured biotypes
○rRNA contamination
○5’ to 3’ end coverage bias
○…

Post-alignment QC - Tools
●STAR alignment results - number of mappings and others
●RSeQC - mapped locations (Read Distribution), library strand specificity
●featureCounts biotypes - summary of mappings to gene biotypes
●FastQ screen (not exactly Post QC) - residual content of rRNA, tRNA, general mapping percentage to
the genome (if selected)
●Qualimap - general alignment statistics focused on RNA-Seq (rnaseq) including gene body coverage

Post-alignment QC - RSeQC
●RSeQC is a general tool for many QC results
●Few of them are
○Read distribution - calculates assignment of reads to different genomic features
○Infer experiment - test strand specificity of the library
○Inner distance - calculates approximate distance between read pairs

Post-alignment QC - FastQ_screen
●FastQ screen is a quick scan of potential mapping locations on different references
●We can use it to do a quick scan of contaminations (various organisms) as well as estimate
residual rRNA content
○In polyA selection based libraries we expect to have less then 2% rRNA content
○In rRNA libraries we can have up to 10-15% of rRNA and still consider it a good library
●Biobloom other option more computationally expensive

Post-alignment QC - Qualimap
●Qualimap performs a numerous checks of the alignment
○One of the modules is rnaseq which is focused directly on RNA-Seq alignments
●One of the main information we can get from this module is the gene body coverage
○We would like to see a nice and even read mapping coverage along the whole length of the genes
○The coverage, however, depends on the library fragmentation (low RIN, FFPE samples but also
depends on the used library kit (Lexogen QuantSeq)!

Note: Gene body coverage
●Often, libraries with high fragmentation (and low RIN numbers) combined with polyA selection might
have strong 3’ end bias
○This is a result of polyA “pulled” fragments
●Some kits, however, target only the polyA tail or sequences close to it
○An example is Lexogen QuantSeq which sequences only one read per mRNA molecule close to polyA tail

Note: Gene body coverage II
Examples
Source: Sigurgeirsson et al. PLoS ONE 2014

Mapping QC
14


15
●Examples
Mapping QC

Feature counting
●Now, when we know our alignments are solid we need to get the number of reads mapped to a gene (or
other feature)
○From there, we can calculate the differential expression
●The question is, how do we summarize the counts
○Do we want only uniquely mapped reads
○Do we want also multi mapped? And how do we assign them? All? One random? Somehow else?
○And what if we have multiple genes which overlap each other?

Strand specific library
●We can basically have three strand specificities
○Non stranded/Unstranded - not very common anymore
■Direction of the read mapping is completely random (50/50)
○Forward (sense) stranded - common for target kits and “bacterial kits”
■Direction of the read mapping is the same as the gene it originates from
○Reverse (antisense) stranded - “default” for Illumina and NEB kits
■Direction of the read mapping is the opposite as the gene it originates from
●In case of paired-end sequencing it’s measure by the first (R1) read orientation (FR, RF)

Read to gene assignment
Figure 2 | Key criteria for evaluation of strand-specific RNA-seq libraries. (a–d) Categories of
quality assessment were complexity (a), strand specificity (b), evenness of coverage (c) and
comparison to known transcript structure (d). Double-stranded genome with gene ORF orientation
(blue arrows) and UTRs (blue lines) are shown along with mapped reads (black and red arrows, reads
mapped to sense and antisense strands, respectively).
https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/rb-rnaseq/tutori
al.html

Feature counting
●The regular settings are - summarize reads mapping to exons (-t exon) and sum them up to gene id
(-g gene_id)
●
●Other possibilities:
○Count per exons
○Include introns
○…

Gene counts - Tools
●featureCounts is build around the “classic” read to gene assignment
○By default, assigns only uniquely mapped reads an only reads uniquely assignable to a single gene
(but both can be changed)
○Gives you raw read counts per gene
●RSEM is efficient in counting also multi mapped reads and can estimate expression of individual
gene isoforms
○Tries to “weight” the probability a mapped position of a multi mapped read and assign it correctly
to the real source
○Gives you estimated counts per gene as well as per isoform and normalized TPM = Transcripts per
million transcripts
●
●But, there is a big differences in the minimal required “good” aligned reads

Minimal number of reads and expression I
●RSEM is less precise in low read counts (<40-50M reads) and for low expressed RNAs (difficult to
estimate)
●For lower read counts it’s safer to go for featureCounts
●Our best practices for a minimal read count for each tools:
○Less than 40-50M aligned reads (to the good stuff) -> featureCounts
○More than 40-50M aligned reads (to the good stuff) -> RSEM
●But if you want isoforms!!! -> RSEM
Source: Roberts et al. Nature Methods 2013

RSEM, Salmon and Kallisto give slightly better estimates of the “real” expression because they
don’t discard any mapped read. Because even if the read is multi mapped it’s still real.

22
Feature count results


Differential expression
●We have our raw read counts but we need to find the real differences
●We want to figure out the change comparing the before and after treatment
●What are the changed genes? Are there even any? Is there even difference between the samples? And
what about the experimental design - paired samples - does it affect the evaluation?
●The tools for the differential expression have to account for different libraries depths, model
and “fix” outliers, account for different levels of expressions, and many other things
●Luckily, there are few tools that have all of this and can be used

Differential expression - tools
●DESeq2
○More specific
●edgeR
○More sensitive
●The important part of the calculation is the design
○Assignment of a group/condition to a sample
○If the samples are paired (the same patient twice) we have to account for this as well!
○Technically, the pairing of the samples is a batch effect so it is similar to have a technical
noise in your data

Pairing of the samples/batch effect
●Paired samples are not the same as paired-end sequencing!


●There is a bad experimental design and a good experimental design
●Very simply - more randomization gives you better results
Pairing of the samples/batch effect

Supplemental Figure 1: The problem of confounding biological variation and batch effects. The top
section depicts a completely confounded study design of processing individual cells from three
biological groups
(represented by shapes) in three separate batches (represented by colors). In this case, we cannot
determine if
biology or batch effects drive the observed variation. The bottom section depicts a balanced study
design consisting
of multiple replicates (rep) split and processed across multiple batches. The use of multiple
replicates allows observed
variation be attributed to biology (cells cluster by shape) or batch effects (cells cluster by
color).

●And example pairing of the patients AND different sequencing years - double batch
Pairing of the samples/batch effect

1st figure - no batch effect/pairing of the samples included - we can see that individual
sequencing years are separated (left and rigth) and also that individual patient samples are
clustered together (even and odd numbers). We don’t see clustering of CXCR4bright and CXCRdim
samples together:(
2nd figure - batch effect/pairing of the samples was included in the calculation and voila -
samples are nicely clustered according to the condition and not according to the patient pairing
and/or year of the sequencing.
Without accounting for the batch effect/pairing of the samples we wouldn’t get any reasonable
results and what even more - we might get completely wrong results!

28
Differential expression results


29
Count normalisation
●Normalize to:
○Gene size
○Library size
●rpkm - Reads Per Kilobase of transcript per Million mapped reads
●fpkm - Fragments Per Kilobase of transcript per Million mapped reads
●tpm - Transcripts Per Million (TPM)
○for every 1,000,000 RNA molecules in the RNA-seq sample, x came from this gene/transcript
●Never ever use normalized counts for any comparisons
○...except comparing a single gene in a single experiment for the samples
○If you really, really need to use any kind of normalized counts to compare use TPM

log2(fold-change)
●Fold-change is usually calculated by average expression of all samples of condition 1 vs average
expression of all samples of condition 2
●Example:
a)geneA expression in pre is 5, in post is 10; fold-change of post/pre is 2 = gene is up-regulated
2x
b)geneB expression in pre is 10, in post is 5; fold-change of post/pre is 0.5 = gene is
down-regulated 1/2x … (O_o)
●Solution: Adding log2 gives us log2(2) = 1, log2(0.5) = -1
●Nice and even distribution around 0 and clear interpretations

log2(fold-change)
●But it might be misleading
●Large log2FC on low-expressed genes are most likely not biologically relevant
●Small log2FC on highly-expressed genes might be biologically relevant
●Example: “Common” cut-off value of fold-change of 2x (log2FC=+/-1) or 1.5x (log2FC=+/-0.58)
○geneA expression in WT is 10 and in KO is 4, log2FC = -1.32 YES (?)
○geneB expression in WT is 1,000,000 and in KO is 500,001, log2FC = -0.99 NO (?)

P-value and adjusted p-value
●P-value tries to give you “a number” saying if the differences you are observing are robust and
the differences are not “random” between the compared conditions/samples
●Adjusted p-value adds a correction for the multiple testing we are doing - tries to add correction
of getting a p-value just by accident
●But is adjusted p-value 0.049 really better than 0.051?
●Number of replicates highly influences the estimates
○The observations might be the same but the statistical significance might be lower

How many differentially expressed genes I have?
It depends how many you want…:)
Selection of the differentially expressed (DE) gene is completely up to you
Some people use p-value, some adjusted p-value and some people log2fc and their combinations, some
just take top n genes
Statistical significance ≠ biological relevance!!!
Scientists rise up against statistical significance, Nature 567, 305-307 (2019), doi:
10.1038/d41586-019-00857-9

https://www.nature.com/articles/d41586-019-00857-9

P-value significance


35
●Example
Differential expression output

36
www.ceitec.eu
CEITEC
@CEITEC_Brno
Vojta Bystry
vojtech.bystry@ceitec.muni.cz
Thank you for your attention!

>