Outline
Sequencing (NGS) in general
Sequencing data analysis in general
Kathi Zarnack and Julian König data
Results (selected)
Galaxy
RNA-Seq data analysis

Sequencing (NGS) in general
PLoS Comput Biol. 2015
Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud.
Griffith M, Walker JR, Spies NC, Ainscough BJ, Griffith OL.

How do we get the data - process of sequencing, isolation of RNA, iCLIP/RNA-Seq principle (very
brief) and all the way to .fastq

Sequencing data analysis in general
PLoS Comput Biol. 2015
Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud.
Griffith M, Walker JR, Spies NC, Ainscough BJ, Griffith OL.

Now we have the fastq files so what do we do with it.
This is just an example, not the workflow we will use.
We need raw reads, annotation and reference genome.
Then we can align and/or assemble transcriptome, quantify or find peak and further process the
data.

Kathi Zarnack data
Cell. 2013
Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of
Alu elements.
Zarnack K, König J, Tajnik M, Martincorena I, Eustermann S, Stévant I, Reyes A, Anders S, Luscombe
NM, Ule J.

Summary of the results
There are ~650,000 Alu elements in transcribed regions of the human genome. These elements contain
cryptic splice sites, so they are in constant danger of aberrant incorporation into mature
transcripts. Despite posing a major threat to transcriptome integrity, little is known about the
molecular mechanisms preventing their inclusion. Here, we present a mechanism for protecting the
human transcriptome from the aberrant exonization of transposable elements. Quantitative iCLIP data
show that the RNA-binding protein hnRNP C competes with the splicing factor U2AF65 at many genuine
and cryptic splice sites. Loss of hnRNP C leads to formation of previously suppressed Alu exons,
which severely disrupt transcript function. Minigene experiments explain disease-associated
mutations in Alu elements that hamper hnRNP C binding. Thus, by preventing U2AF65 binding to Alu
elements, hnRNP C plays a critical role as a genome-wide sentinel protecting the transcriptome. The
findings have important implications for human evolution and disease.
Cell. 2013
Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of
Alu elements.
Zarnack K, König J, Tajnik M, Martincorena I, Eustermann S, Stévant I, Reyes A, Anders S, Luscombe
NM, Ule J.

Summary of the paper
Highlight - the cryptic exons protection and competition between hnRNP C and U2AF65

Results (selected)
RNA-Seq coverage over exons
In control, cryptic exons are not expressed
In hnRNP C knockdown, cryptic exons are expressed
Cell. 2013
Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of
Alu elements.
Zarnack K, König J, Tajnik M, Martincorena I, Eustermann S, Stévant I, Reyes A, Anders S, Luscombe
NM, Ule J.

Highlight coverage of Alu exons
Here, at PTS gene but we will show CD55

Goal of the practical
Get from the raw sequencing data to the gene expression (RNA-Seq)
Analyze RNA-Seq data and get differential gene expression and expression of individual exons
(example at gene CD55 gene)
Show coverage cryptic exon(s) (example at gene CD55)
Do everything in less that half a day

Galaxy practical
Get the data
Or you just load the preloaded data
Shared Data -> Data Libraries -> Bi5444 -> RNA-Seq

Galaxy practical
1.
2.
3.

Galaxy practical


Galaxy practical
Get the data


Galaxy practical
Initial quality check
Check the raw reads quality
Using FastQC tool
Input FASTQ, output HTML

Galaxy practical
Initial quality check


Galaxy practical
Initial quality check
It is still running, right?
But without that, we cannot proceed 😞
We have a solution! :)
Import Galaxy history

Most likely the alignment will take quite some time. Guide them to the sharing of the history in
Galaxy.

Galaxy practical
Initial quality check
Import Galaxy history

Galaxy practical
Initial quality check
Import Galaxy history

Type CEITEC in the search box.

Galaxy practical
Initial quality check
Import Galaxy history

Initial quality check
HTML report(s)
Galaxy practical

You can see we will show you the analysis only on one file but you can do it easily for the rest of
them

Galaxy practical
Initial quality check
HTML report(s)
But there is too many of them
MultiQC - makes you life simpler
This time, you can try it on your own!

You can see we will show you the analysis only on one file but you can do it easily for the rest of
them at home.

Galaxy practical
Summary of the logs
Summarize the output logs
Using MultiQC tool
Input LOG(s), output HTML

Galaxy practical
Summary of the logs


Galaxy practical
Initial quality check - Adapter content

If you scroll at the way down you can see some residual adapter content
But if you hover over the lines you would see the percentages are very low (you can also see it on
y-axis scale)

Galaxy practical
Read preprocessing
Remove adapters and/or trim low-quality ends
Using Trimmomatic trimmer
Input FASTQ, output FASTQ

Galaxy practical
Read preprocessing


Galaxy practical
Read preprocessing
Adapter sequence (partial):
>adapter
AGATCGGAAGAGC

Galaxy practical
Read preprocessing


Galaxy practical
Preprocessing quality check
Check the preprocessed reads quality and summarize
Using FastQC & MultiQC tools
Input FASTQ/LOG, output HTML

Galaxy practical
Preprocessing quality check
Please, run the FastQC and MultiQC on the preprocessed files and check the adapter content

Galaxy practical
Preprocessing quality check
Share Data -> Histories -> Bi5444_RNA-Seq_preprocess

Galaxy practical
Preprocessing quality check
Check the preprocessed reads quality & summarize
Are all the bad things gone?

Galaxy practical
Preprocessing quality check

Yep, we are nice and clean!

Galaxy practical
Preprocessing quality check
Check the preprocessed reads quality & summarize
Are all the bad things gone?
Actually, for modern aligners such as STAR, it doesn’t matter that much
They can perform soft-clipping

Galaxy practical
Soft-clipping in alignment
Hiding of non-matching parts of the reads
Can overcome remaining adapter or low-quality sequences
Only to specified limits (in STAR the default is max. 33% of the read length)
Soft-clipped part

But allowing to much soft-clipping results in ambiguous mapping and possible cross-mapping events
For example default minimal aligned length after soft-clipping in BWA-MEM (popular aligner for DNA
reads) is only 19 nt! This alignment length is very unspecific and you can easily map bacterial
sequences to the human reference genome

Galaxy practical
Alignment to genome
Align RNA-Seq data to a genome
Using STAR aligner
Input FASTQ, output BAM

Alignment to genome
Galaxy practical

The name of the input file might be different.

Sit back, wait and relax

The USA is most likely to be waking up right now so the servers are very busy.

Alignment to genome
Share Data -> Histories -> Bi5444_RNA-Seq_alignment
Galaxy practical

The name of the input file might be different.

Alignment to genome
STAR performs well even with defaults
Main output is the BAM file
This is one of the few files worth to keep and save
Galaxy practical

The name of the input file might be different.

Galaxy practical
Quality control of alignment
Run MultiQC to assess the alignment

Galaxy practical
Quality control of alignment


Galaxy practical
Rename and tags
Better names comprehensibility

Galaxy practical
Rename and tags
Better names comprehensibility

Galaxy practical
Gene counts
For the raw gene counts (expression) you need to have a list of genes and their positions in the
genome - gene annotation
Using UCSC Main table browser
Input nothing, output GTF

Galaxy practical
Gene counts
Share Data -> Data Libraries -> Bi5444
-> RNA-Seq -> Ensemble_Homo_sapiens.GRCh38.94.gtf.gz

Galaxy practical
Gene counts
Get the raw gene counts
Using featureCounts tool
Input BAM and annotation GTF, output TXT (raw gene counts)

Galaxy practical
Gene counts


Galaxy practical
Gene counts
Quality control of gene counts
Again MultiQC

Galaxy practical


Galaxy practical
Differential gene expression
Get differential gene expression from the raw counts
Using edgeR and DESeq2 tools

Galaxy practical
Differential gene expression - note
Optimally, the experiment should be designed with at least three biological replicates
However, if the data are only “supportive” two replicates is enough

Galaxy practical
Differential gene expression
Shared Data -> Histories -> Bi5444_RNA-Seq_DE_start

Galaxy practical
Differential gene expression


Galaxy practical
Differential gene expression


Galaxy practical
Differential gene expression

We can see that the samples ale nicely clustering together

Galaxy practical
Differential gene expression


Galaxy practical
Gene symbol annotation
But we do not see any gene symbol/names which we all like
Merge with HUGO information (https://www.genenames.org/)
Share Data -> Data Libraries -> Bi5444
-> RNA-Seq -> HUGO Gene information

Galaxy practical
Gene symbol annotation
Merge with HUGO information

Galaxy practical
Gene symbol annotation


Galaxy practical
Gene symbol annotation


Galaxy practical
Differential gene expression


Galaxy practical
Differential gene expression


Galaxy practical
Alignment coverage
Visualization of coverage of aligned data (and expressed exons)
Using deepTools -> bamCoverage
Input BAM, output BIGWIG

Galaxy practical
Alignment coverage
Effective genome size (hg38): 2913022398
https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html

Note: The versions of the tools changed from 2017 and it might look slightly different.

Galaxy practical
Alignment coverage
BIGWIG coverage

Galaxy practical
Alignment coverage
BIGWIG visualization

Galaxy practical
Alignment coverage
CD55 region

Galaxy practical
Alignment coverage
BIGWIG size

Galaxy practical
Alignment coverage
Now you do it for the other two BIGWIG files

There is an antisense Alu present at “exon10”
You can actually see there is one more antisense Alu and it is also slightly expressed after hnRNP
C knockdown
And if you look further the exon is annotated in RefSeq as well (both of them)

Galaxy practical
Alignment coverage
If something went wrong, history of DE and coverage visualization
 Shared Data -> Histories -> Bi5444_RNA-Seq_DE_full_history

RNA-Seq data analysis - pipeline in Galaxy
1.Initial quality check - FastQC
○Check for overall quality of the data, number of reads, read length distribution, ...
2.Preprocessing - Trimmomatic
○Remove adapters, low quality ends, unwanted sequences, …
3.Alignment - STAR
○Map reads to the reference genome
4.Alignment quality check - STAR log, featureCounts
○Check overall alignment statistics
5.Genome coverage (peaks) - bamCoverage
○Get overview of mapped positions in the genome
6.Gene annotation - UCSC Main table browser
○Get gene annotations for reference genome
7.Quantification - featureCounts
○Get gene read counts
8.Differential gene expression - edgeR,DESeq2 (genes)
○Differences between conditions

RNA-Seq data analysis - other possibilities
1.Initial quality check - FastQC
○Check for overall quality of the data, number of reads, read length distribution, ...
2.Preprocessing - Cutadapt, BBTools, seqtk
○Remove adapters, low quality ends, unwanted sequences, …
3.Alignment - GSNAP, Bowtie2
○Map reads to the reference genome
4.Alignment quality check - Picard tools, RSeQC, Qualimap
○Check overall alignment statistics
5.Genome coverage (peaks) - STAR + bedGraphToBigWig, Bedtools
○Get overview of mapped positions in the genome
6.Gene annotation - UCSC, Ensembl, NCBI
○Get gene annotations for reference genome
7.Quantification - RSEM, HTSeq, Salmon, Kallisto
○Get gene read counts
8.Differential gene expression - DEXSeq (exons), baySeq (genes)
○Differences between conditions
9.Gene ontology and pathways - g:Profiler, KEGG
○Check ontologies and pathways for selected genes