1
Metagenomics data analysis
Mgr. Eva Budinská, Ph.D.
Ing. Vojtěch Bartoň
eva.budinska@recetox.muni.cz, vojtech.barton@recetox.muni.cz
IV110 Projekt z bioinformatiky I
IV114 Projekt z bioinformatiky a systémové biologie
E4014 Projekt z Matematické biologie a biomedicíny - biomedicínská
bioinformatika
Metagenomics is the study of genetic material
recovered directly from environmental samples. permafrost
soil
water
stool
indoordust
skin
Metagenomics hence studies microbiome
Microbiome
• is a community of microorganismsthat can usually be
found living together in a specific environment
• Microorganism- a single-celled organism that can only
be seen under a microscope
• Bacteria, Viruses, Fungi, Yeast, Algae, ...
Dysbiosis when
something
goes wrong
• Microbiome out of balance
• Associated with many diseases,
including cancer
Specifics of
metagenomic data
analysis
• Metagenomics studies sample DNA from the
whole community
• A metagenomic sample often contains reads
from a huge number of (micro)organisms of
which many are unknown
• The sequences are often incompleteand hard
to assemble to individual genes or recover full
genomes of each organism
Two main approaches
Sequencing specific target genes (16S rRNA, 18S
rRNA, ITS, rpoB…)
Result: A quick estimate of taxonomic diversity and
composition.
Marker-gene metagenomics
(targeted sequencing)
Two main approaches
Sequence all genomic fragments from the sample.
Result: Insights into composition and function of the
microbiome of the sample.
Sequencing specific target genes (16S rRNA,
18S rRNA, ITS, rpoB…)
Result: A quick estimate of taxonomic
diversity and composition.
Shotgun metagenomics
(whole-genome sequencing)
Marker-gene metagenomics
(targeted sequencing)
Marker-gene metagenomics
(targeted sequencing)
• Aim: obtain taxonomicalrepresentation of microbiome in the
sample using specific target genes
Marker-gene (targeted)
metagenomics
For bacteria: gene for 16s rRNA and its
selected variableregions
Marker-gene metagenomics
(targeted sequencing) - basic workflow
.fastq
The usual data analysis pipeline
1. Preprocessing:Fastq files
preprocessing, deconvolution, QC,
trimming, joining reads
2. Identification of sequences
representative of potential species
3. Taxonomy assignment
4. Exploratory analysis
• Analysis of diversity measures and their
visualization
5. Inference analysis
• Associating composition with variables of
interest
.fastq
Raw
sequences
1. Deconvolution,
QC, filtering,
trimming,
PE read joiningFiltered
sequences
.fastq
.fastq
.fastq
OTUs /
ASVs
3. Taxonomy
assignment
(species
identification)
Exploring and
inference
Alpha diversity
2. Binning
The usual data analysis pipeline
1. Preprocessing:Fastq files
preprocessing, deconvolution, QC,
trimming, joining reads
2. Identification of sequences
representative of potential species
3. Taxonomy assignment
4. Exploratory analysis
• Analysis of diversity measures and their
visualization
5. Inference analysis
• Associating composition with variables of
interest
.fastq
Raw
sequences
1. Deconvolution,
QC, filtering,
trimming,
PE read joiningFiltered
sequences
.fastq
.fastq
.fastq
OTUs /
ASVs
3. Taxonomy
assignment
(species
identification)
Exploring and
inference
Alpha diversity
2. Binning
Marker-gene metagenomics
Step 1. Preprocessing and QC
• Marker gene metagenome is small => many samples are
combined within a single run
• Not uncommon to have all the reads in one fastq file
• Samples need to be barcoded and demultiplexed
• Followed by standard QC, trimming, joining PE reads
The usual data analysis pipeline
1. Preprocessing:Fastq files
preprocessing, deconvolution, QC,
trimming, joining reads
2. Identification of sequences
representative of potential species
3. Taxonomy assignment
4. Exploratory analysis
• Analysis of diversity measures and their
visualization
5. Inference analysis
• Associating composition with variables of
interest
.fastq
Raw
sequences
1. Deconvolution,
QC, filtering,
trimming,
PE read joiningFiltered
sequences
.fastq
.fastq
.fastq
OTUs /
ASVs
3. Taxonomy
assignment
(species
identification)
Exploring and
inference
Alpha diversity
2. Binning
Marker-gene metagenomics
Step 2. Identification of sequences
representative of potential species
• Aim: Organize reads according to their individual/organism
of origin
• Theory: one read ~ one gene ~ one individual ~ one species
• Reality:
– one species can have multiple and differentcopies of a gene
– one sequence can be shared by multiple species
– problem to distinguishsequencing error from a real change
between species
AACCGTCGACGGTCAT
TTGCCATGACGATATA
TTGCCATGACGATATA
AACCGTCGACGGTCAT
AACCGTCGACGGTCAT
Marker-gene metagenomics
Step 2. Identification of sequences
representative of potential species
• Solution: Clustering/binning of similar sequences into an
OTU – operational taxonomic unit
• Similarity: 97% or less or more...
• "OTU picking"
• Final representativesequence of an OTU is a consensus
sequence (average, …)
Marker-gene metagenomics
Clustering de novo
• Clustering based on similarity
of sequencing without taking into
account reference database
Marker-gene metagenomics
Clustering de novo
•Clustering based on similarity
of sequencing without taking into
account reference database
•Disadvantages:
•if new samples added, we need
to recluster (reanalyze)​
Marker-gene metagenomics
Clustering de novo
•Clustering based on similarity
of sequencing without taking into
account reference database
•Disadvantages:
•if new samples added, we need
to recluster (reanalyze)​, this means,
clustering can change
Marker-gene metagenomics
Clustering de novo
•Clustering based on similarity
of sequencing without taking into
account reference database
•Disadvantages:
•if new samples added, we need
to recluster (reanalyze)​, this means,
clustering can change
•computationally expensive
•only way if reference is unknown
Marker-gene metagenomics
Clustering - closed reference
•Using reference databases, we
cluster around known sequences
Marker-gene metagenomics
Clustering - closed reference
•Using reference databases, we
cluster around known sequences
•Disadvantages:
•only for well characterized
types of samples (stool?)
•discarding unknown
Marker-gene metagenomics
Clustering – open reference
•Using reference databases, we
cluster around known sequences
•Those that did not cluster with
reference are clustered de novo
Marker-gene metagenomics
Clustering – open/close reference
•PROBLEM: REFERENCE BIAS
well studied microbes have
hundreds to thousands sequences
as opposed to few or none of the
less studied ones
From Sun Y1, Cai Y, HuseSM, Knight R, FarmerieWG, Wang X, Mai V. (2011) A large-scale benchmark study of existing algorithms for taxonomy-independent
microbial community analysis.Brief Bioinform. 2012 Jan;13(1):107-21
Marker-gene metagenomics
Bunch of problems...
Substitutions
ACTGCTAGC
ACTGATAGC
• PCR
• sequencing
How can we tell errors from real changes?
Chimeras
ACTGTAGC
ACTGGGCT
AGACGGCT
• PCR
Marker-gene metagenomics
Correcting errors
• Typical: QC, trimming, Nfiltering,
length-filtering,
adapter removal
• OTU picking per se helps
correcting some errors, but
the problem persists
Marker-gene metagenomics
OTU picking
Marker-gene metagenomics
Let's be smarter - build an error model
We can calculate posterior
probability of a sequence being
real vs artefact, based on our
knowledge of sequence errors and
their frequencies.
We can test whether our observed
frequency is larger than expected
frequency and get a p-value.
If we reject the hypothesis, we
keep that sequence!
Marker-gene metagenomics
Applying error model
•I discard improbable sequences
Marker-gene metagenomics
Applying error model
•I discard improbable sequences
Marker-gene metagenomics
Applying error model
•I discard improbable sequences
•Each sequence now represents a
taxonomical unit
Marker-gene metagenomics
Applying error model – getting ASVs
•I discard improbable sequences
•Each sequence now represents a
taxonomical unit called
Amplicon Sequence Variant – ASV
• Only 1 sequence represents
the bacteria
Also known as ESV (exact sequence variant) or zOTU (zeroradius
OUT)​
Marker-gene metagenomics
Applying error model – getting ASVs
•Already existing ASVs are not
changed if new samples are added
Marker-gene metagenomics
Applying error model – getting ASVs
•However! Adding new samples
can result in previously discarded
sequences becoming
more frequent and become ASVs!
•Already existing ASVs are not
changed if new samples are added
Marker-gene metagenomics
Chimera removal
The usual data analysis pipeline
1. Preprocessing:Fastq files
preprocessing, deconvolution, QC,
trimming, joining reads
2. Identification of sequences
representative of potential species
3. Taxonomy assignment
4. Exploratory analysis
• Analysis of diversity measures and their
visualization
5. Inference analysis
• Associating composition with variables of
interest
.fastq
Raw
sequences
1. Deconvolution,
QC, filtering,
trimming,
PE read joiningFiltered
sequences
.fastq
.fastq
.fastq
OTUs /
ASVs
3. Taxonomy
assignment
(species
identification)
Exploring and
inference
Alpha diversity
2. Binning
Marker-gene metagenomics
Step 3. Taxonomy assignment
Marker-gene metagenomics
Step 3. Taxonomy assignment
•A relatively easy task
•Just alignd ASVs to the reference
db!
• BLAST, HMMER or USEARCH
•But – how to deal with multiple
results?
Marker-gene metagenomics
LCA – Least common ancestor
Marker-gene metagenomics
LCA – Least common ancestor
Marker-gene metagenomics
LCA – Least common ancestor
Ok, all agree
Marker-gene metagenomics
LCA – Least common ancestor
RECETOXBACTERIALES
Ok, all agree
Marker-gene metagenomics
LCA – Least common ancestor
RECETOXBACTERIALES
Marker-gene metagenomics
LCA – Least common ancestor
RECETOXBACTERIALES
Marker-gene metagenomics
LCA – Least common ancestor
Ok, most agree
RECETOXBACTERIALES
Marker-gene metagenomics
LCA – Least common ancestor
Ok, most agree
RECETOXBACTERIALES RECETOXBACTERIA
Marker-gene metagenomics
LCA – Least common ancestor
RECETOXBACTERIALES RECETOXBACTERIA
Marker-gene metagenomics
LCA – Least common ancestor
RECETOXBACTERIALES RECETOXBACTERIA
Marker-gene metagenomics
LCA – Least common ancestor
???
RECETOXBACTERIALES RECETOXBACTERIA
Marker-gene metagenomics
LCA – Least common ancestor
???
RECETOXBACTERIALES RECETOXBACTERIA UNASSIGNED
Marker-gene metagenomics
LCA – Least common ancestor
RECETOXBACTERIALES RECETOXBACTERIA UNASSIGNED
Marker-gene metagenomics
Database bias
Marker-gene metagenomics
Reference databases for 16SrRNA
Beware of regularly updated versions of the dbs
Marker-gene metagenomics
Taxonomy classifiers for 16S rRNA gene sequences
The usual data analysis pipeline
1. Preprocessing:Fastq files
preprocessing, deconvolution, QC,
trimming, joining reads
2. Identification of sequences
representative of potential species
3. Taxonomy assignment
4. Exploratory analysis
• Analysis of diversity measures and their
visualization
5. Inference analysis
• Associating composition with variables of
interest
.fastq
Raw
sequences
1. Deconvolution,
QC, filtering,
trimming,
PE read joiningFiltered
sequences
.fastq
.fastq
.fastq
OTUs /
ASVs
3. Taxonomy
assignment
(species
identification)
Exploring and
inference
Alpha diversity
2. Binning
Alpha diversity (within-sample diversity)
• Used in ecology, a measure of how diverse a single sample is
1. Shannon index 2. Simpson index
3. Number of ASVs
Alpha diversity (within-sample diversity)
3. Number of OTUs/ASVs 4. Chao 1
A= N + S2 / (2 D)
• N is the number of OTUs/ASVs
• S is the number of singleton
OTUs/ASVs
• D is the number of doublet OTUs,
i.e. OTUs with abundance2.
PCA – principal component analysis
Based on the
normalized c
ompositional
profiles
Clustering
Based on the
normalized c
ompositional
profiles
Group comparison – comparing differences
between groups
Applying statistical testing on each
bacteria to determine difference in
their abundance
Statistical considerations
Compositional nature of the metagenomic data
• The microbiome abundancies
(read counts)are constrained by
the maximum number of DNA
reads that the sequencer can
provide (the total count constraint)
• Hence the data representsin fact a
proportion (composition) of taxa!
3000
500
14000
4000
200
24000
5000
100
300
SAMPLE 1 SAMPLE 2 SAMPLE 3
Firmicutes Bacteriodetes Alphaproteobacteria
The data is compositional – so what?
• The compositional nature of the data induces strong dependencies among the
abundances of the different taxa:
• an increase in the abundance of one taxon implies the decrease of the observed number
of counts – hence proportions - for other taxaand vice versa
3000
500
14000
4000
200
24000
5000
100
300
SAMPLE 1 SAMPLE 2 SAMPLE 3
Firmicutes Bacteriodetes Alphaproteobacteria
3000
500
14000
4000
24000
100
300
SAMPLE 1 SAMPL E 2 SAMPLE 3
Firmicutes Bacteriodetes Alphaproteobacteria
Before treatment After treatment
The data is compositional – so what?
In a composition the value of each component is not informative by itself and
the relevant information is contained in the ratios between the components
The most known: Firmicutes / Bacteroides ratio
Firmicutes/Bacteroides ratio?
Effect of DNA isolation kit on Gram+vs Gram- bacteria
Videnskaet al. (2019)Stool sampling and DNA isolation kits affectDNA
quality and bacterial compositionfollowing 16S rRNA gene sequencing
using MiSeq Illumina platform,Scientific Reports
Effect of sampling kit on G+
and G- bacteria
Videnskaet al. (2019)Stool sampling and DNA isolation kits affectDNA quality
and bacterial compositionfollowing 16S rRNA gene sequencing using MiSeq
Illumina platform, Scientific Reports
The data is compositional– so what? / part 2
• Compositional data do not exist in the Euclidean space, but in a special
constraint space called the simplex
• Hence it is incorrect to apply any multivariable techniques that are dependent
on this space without proper data transformation (e.g. PCA, clustering….)
BacteroidesFirmicutes
Actinobacteria
100%
0%
PCA on compositional data (without proper
transformation)
individuals
Statistical methods for analysis of compositional
data need to fulfill these criteria:
1. Scale Invariance (e.g. the result should be the same regardless of the scale of
the measurement)
• Example: how similar are these two samples?
% Absolute read counts
A B A B
Fusobacteria 10 11 700 11000
Proteobacteria 15 14 1050 14000
Bacterioides 25 20 1750 20000
Firmicutes 50 55 3500 55000
Euclidean
distance
7.2 57088.6
Statistical methods for analysis of compositional
data need to fulfill thesecriteria:
2. Subcompositional coherence (e.g. the analyses should lead to the same conclusions
regardless of which components of the data are included)
This is especially a problem for correlations between taxa, which tend to be more negative
when we remove some taxa and recalculate the proportions.
Statistical methods for analysis of compositional
data need to fulfill these criteria:
2. Subcompositional coherence (e.g. the analyses should lead to the same
conclusions regardless of which components ofthe data are included)
Alternative(s) to correlation:
1. phi (Φ) = var(Ai -Aj)/var(Ai)
2. rho (⍴) = var(Ai -Aj)/(var(Ai) + var(Aj))
3. phis (Φs) = var(Ai -Aj)/var(Ai +Aj)
Ai is the log-transformed values for a metagenomic component ‘i’ in the data
Aitchison, 1982, J.R.Statist. Soc.
Lovell et al., 2015, PLoS Comp Biol
Quinn et al, 2017, Scientific Reports 7
Data transformation (normalization)
• Compositional data can be normalized in order to make them suitable for existing statistical
techniques
• Aitchinson, 1982 - build a theory and concepts of analysis of compositional data and suggested
normalizations
• Basic concept – make log-ratios between components
ALR (additive log-ratio transformation)
+ good for most statistical techniques
– needs careful selection of one
component, we are working with k-1
taxa, more difficult to interpret
CLR (centered log-ratio transformation)
+ ratio to geometric mean, preserves all
taxa, no need to select one
– creates singular covariancematrix
ILR (isometric log-ratio
transformation)[Egozcque, 2003]
PhILR (phylogenetic partitioning
based ILR transform)[Silverman
et al, 2017]
Compositional data - PCA before and after
normalization
PCA on absolute counts–
main variance lies in the
sequencing depth
Relativedata – problem
with simplex, colour by
individual
CLR transformed data–
colour by individual
The excess zero problem
• Log-ratio transformations require data with
positive values, any statistical analysis of count
compositions must be preceded by a proper
replacement of the zeros
• What to do?
• We need to fill in the zeroes....
• E.g. Bayesian multiplicative treatment of count
zeros [Martín-Fernandez,2014,StatisticalModelling]
We do not know whether the
zeros are real or just below th
reshold
The 16SrRNAgene copy number problem
• Different taxa have different number of copies of the 16S rRNA gene -
this
• Some algorithms exist that try to normalize this count, but for many
taxa this is unknown, hence estimation takes place.
• Should we normalize for count or not?
Take home messages
• Metagenomic data are compositional and it is not optional!
• Compositional data do not exist in Euclidean space
• Transformations of compositional data and specific statistical
approaches are needed for data analysis
• It is not easy to make correlations between the taxa
• Normalization to 16SrRNA gene copy number change is necessary,
but still quite tricky and the benefits are not clear
Resources to study
• OUT vs ASV explained VIDEO
• Dan Knight: microbiome discovery – series of videos
(however, uses QIIME and OTUs, which is outdated now, still very worth it!)
• PathoScope 2.0: a complete computationalframework for strain identification in
environmental or clinical sequencing samples | Microbiome | Full Text
(biomedcentral.com)
Metagenomic pipelines
Pipelines
• DADA2
• QIIME2
• Mothur
• PathoScope 2.0
• Kraken
Databases
• SILVA 138
• GreenGenes 13_8
• RefSeq2020
• Kraken
• Blast nt/rna
RCX pipeline – nf-core/ampliseq
Quality control
• Data preprocessing
• Check reads quality
• Perform filt&trim
ASV calculation
• DADA2
• Error estimation
• Chimera removal
• Contamination removal
• Filtering
Assign taxonomy
• Database dependent
• Infer species
• Confidence intervals
• Multiple assignment
Taxonomic filtering
• Filter specific taxa
• Abundance filtering
Post processing
• Visualisation
• Diversity computation
• Functional analysis
Visualisations
Picrust2
• Phylogenetic Investigationof
Communities by Reconstructionof
UnobservedStates
• Functional analysis
• Gene families
• KEGG and COG database
• Based on phylogeny
• Genes present in microbial
genomes are similar amongst
relatives
• When sufficient genome sequences
are available, it is possible to
predict which gene families are
present in a given microbial OTU
from phylogeny alone.
Reporting
• Output files
• Additional analysis
• In-house post-processing