1 Metagenomics data analysis Mgr. Eva Budinská, Ph.D. Ing. Vojtěch Bartoň eva.budinska@recetox.muni.cz, vojtech.barton@recetox.muni.cz IV110 Projekt z bioinformatiky I IV114 Projekt z bioinformatiky a systémové biologie E4014 Projekt z Matematické biologie a biomedicíny - biomedicínská bioinformatika Metagenomics is the study of genetic material recovered directly from environmental samples. permafrost soil water stool indoordust skin Metagenomics hence studies microbiome Microbiome • is a community of microorganismsthat can usually be found living together in a specific environment • Microorganism- a single-celled organism that can only be seen under a microscope • Bacteria, Viruses, Fungi, Yeast, Algae, ... Dysbiosis when something goes wrong • Microbiome out of balance • Associated with many diseases, including cancer Specifics of metagenomic data analysis • Metagenomics studies sample DNA from the whole community • A metagenomic sample often contains reads from a huge number of (micro)organisms of which many are unknown • The sequences are often incompleteand hard to assemble to individual genes or recover full genomes of each organism Two main approaches Sequencing specific target genes (16S rRNA, 18S rRNA, ITS, rpoB…) Result: A quick estimate of taxonomic diversity and composition. Marker-gene metagenomics (targeted sequencing) Two main approaches Sequence all genomic fragments from the sample. Result: Insights into composition and function of the microbiome of the sample. Sequencing specific target genes (16S rRNA, 18S rRNA, ITS, rpoB…) Result: A quick estimate of taxonomic diversity and composition. Shotgun metagenomics (whole-genome sequencing) Marker-gene metagenomics (targeted sequencing) Marker-gene metagenomics (targeted sequencing) • Aim: obtain taxonomicalrepresentation of microbiome in the sample using specific target genes Marker-gene (targeted) metagenomics For bacteria: gene for 16s rRNA and its selected variableregions Marker-gene metagenomics (targeted sequencing) - basic workflow .fastq The usual data analysis pipeline 1. Preprocessing:Fastq files preprocessing, deconvolution, QC, trimming, joining reads 2. Identification of sequences representative of potential species 3. Taxonomy assignment 4. Exploratory analysis • Analysis of diversity measures and their visualization 5. Inference analysis • Associating composition with variables of interest .fastq Raw sequences 1. Deconvolution, QC, filtering, trimming, PE read joiningFiltered sequences .fastq .fastq .fastq OTUs / ASVs 3. Taxonomy assignment (species identification) Exploring and inference Alpha diversity 2. Binning The usual data analysis pipeline 1. Preprocessing:Fastq files preprocessing, deconvolution, QC, trimming, joining reads 2. Identification of sequences representative of potential species 3. Taxonomy assignment 4. Exploratory analysis • Analysis of diversity measures and their visualization 5. Inference analysis • Associating composition with variables of interest .fastq Raw sequences 1. Deconvolution, QC, filtering, trimming, PE read joiningFiltered sequences .fastq .fastq .fastq OTUs / ASVs 3. Taxonomy assignment (species identification) Exploring and inference Alpha diversity 2. Binning Marker-gene metagenomics Step 1. Preprocessing and QC • Marker gene metagenome is small => many samples are combined within a single run • Not uncommon to have all the reads in one fastq file • Samples need to be barcoded and demultiplexed • Followed by standard QC, trimming, joining PE reads The usual data analysis pipeline 1. Preprocessing:Fastq files preprocessing, deconvolution, QC, trimming, joining reads 2. Identification of sequences representative of potential species 3. Taxonomy assignment 4. Exploratory analysis • Analysis of diversity measures and their visualization 5. Inference analysis • Associating composition with variables of interest .fastq Raw sequences 1. Deconvolution, QC, filtering, trimming, PE read joiningFiltered sequences .fastq .fastq .fastq OTUs / ASVs 3. Taxonomy assignment (species identification) Exploring and inference Alpha diversity 2. Binning Marker-gene metagenomics Step 2. Identification of sequences representative of potential species • Aim: Organize reads according to their individual/organism of origin • Theory: one read ~ one gene ~ one individual ~ one species • Reality: – one species can have multiple and differentcopies of a gene – one sequence can be shared by multiple species – problem to distinguishsequencing error from a real change between species AACCGTCGACGGTCAT TTGCCATGACGATATA TTGCCATGACGATATA AACCGTCGACGGTCAT AACCGTCGACGGTCAT Marker-gene metagenomics Step 2. Identification of sequences representative of potential species • Solution: Clustering/binning of similar sequences into an OTU – operational taxonomic unit • Similarity: 97% or less or more... • "OTU picking" • Final representativesequence of an OTU is a consensus sequence (average, …) Marker-gene metagenomics Clustering de novo • Clustering based on similarity of sequencing without taking into account reference database Marker-gene metagenomics Clustering de novo •Clustering based on similarity of sequencing without taking into account reference database •Disadvantages: •if new samples added, we need to recluster (reanalyze)​ Marker-gene metagenomics Clustering de novo •Clustering based on similarity of sequencing without taking into account reference database •Disadvantages: •if new samples added, we need to recluster (reanalyze)​, this means, clustering can change Marker-gene metagenomics Clustering de novo •Clustering based on similarity of sequencing without taking into account reference database •Disadvantages: •if new samples added, we need to recluster (reanalyze)​, this means, clustering can change •computationally expensive •only way if reference is unknown Marker-gene metagenomics Clustering - closed reference •Using reference databases, we cluster around known sequences Marker-gene metagenomics Clustering - closed reference •Using reference databases, we cluster around known sequences •Disadvantages: •only for well characterized types of samples (stool?) •discarding unknown Marker-gene metagenomics Clustering – open reference •Using reference databases, we cluster around known sequences •Those that did not cluster with reference are clustered de novo Marker-gene metagenomics Clustering – open/close reference •PROBLEM: REFERENCE BIAS well studied microbes have hundreds to thousands sequences as opposed to few or none of the less studied ones From Sun Y1, Cai Y, HuseSM, Knight R, FarmerieWG, Wang X, Mai V. (2011) A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis.Brief Bioinform. 2012 Jan;13(1):107-21 Marker-gene metagenomics Bunch of problems... Substitutions ACTGCTAGC ACTGATAGC • PCR • sequencing How can we tell errors from real changes? Chimeras ACTGTAGC ACTGGGCT AGACGGCT • PCR Marker-gene metagenomics Correcting errors • Typical: QC, trimming, Nfiltering, length-filtering, adapter removal • OTU picking per se helps correcting some errors, but the problem persists Marker-gene metagenomics OTU picking Marker-gene metagenomics Let's be smarter - build an error model We can calculate posterior probability of a sequence being real vs artefact, based on our knowledge of sequence errors and their frequencies. We can test whether our observed frequency is larger than expected frequency and get a p-value. If we reject the hypothesis, we keep that sequence! Marker-gene metagenomics Applying error model •I discard improbable sequences Marker-gene metagenomics Applying error model •I discard improbable sequences Marker-gene metagenomics Applying error model •I discard improbable sequences •Each sequence now represents a taxonomical unit Marker-gene metagenomics Applying error model – getting ASVs •I discard improbable sequences •Each sequence now represents a taxonomical unit called Amplicon Sequence Variant – ASV • Only 1 sequence represents the bacteria Also known as ESV (exact sequence variant) or zOTU (zeroradius OUT)​ Marker-gene metagenomics Applying error model – getting ASVs •Already existing ASVs are not changed if new samples are added Marker-gene metagenomics Applying error model – getting ASVs •However! Adding new samples can result in previously discarded sequences becoming more frequent and become ASVs! •Already existing ASVs are not changed if new samples are added Marker-gene metagenomics Chimera removal The usual data analysis pipeline 1. Preprocessing:Fastq files preprocessing, deconvolution, QC, trimming, joining reads 2. Identification of sequences representative of potential species 3. Taxonomy assignment 4. Exploratory analysis • Analysis of diversity measures and their visualization 5. Inference analysis • Associating composition with variables of interest .fastq Raw sequences 1. Deconvolution, QC, filtering, trimming, PE read joiningFiltered sequences .fastq .fastq .fastq OTUs / ASVs 3. Taxonomy assignment (species identification) Exploring and inference Alpha diversity 2. Binning Marker-gene metagenomics Step 3. Taxonomy assignment Marker-gene metagenomics Step 3. Taxonomy assignment •A relatively easy task •Just alignd ASVs to the reference db! • BLAST, HMMER or USEARCH •But – how to deal with multiple results? Marker-gene metagenomics LCA – Least common ancestor Marker-gene metagenomics LCA – Least common ancestor Marker-gene metagenomics LCA – Least common ancestor Ok, all agree Marker-gene metagenomics LCA – Least common ancestor RECETOXBACTERIALES Ok, all agree Marker-gene metagenomics LCA – Least common ancestor RECETOXBACTERIALES Marker-gene metagenomics LCA – Least common ancestor RECETOXBACTERIALES Marker-gene metagenomics LCA – Least common ancestor Ok, most agree RECETOXBACTERIALES Marker-gene metagenomics LCA – Least common ancestor Ok, most agree RECETOXBACTERIALES RECETOXBACTERIA Marker-gene metagenomics LCA – Least common ancestor RECETOXBACTERIALES RECETOXBACTERIA Marker-gene metagenomics LCA – Least common ancestor RECETOXBACTERIALES RECETOXBACTERIA Marker-gene metagenomics LCA – Least common ancestor ??? RECETOXBACTERIALES RECETOXBACTERIA Marker-gene metagenomics LCA – Least common ancestor ??? RECETOXBACTERIALES RECETOXBACTERIA UNASSIGNED Marker-gene metagenomics LCA – Least common ancestor RECETOXBACTERIALES RECETOXBACTERIA UNASSIGNED Marker-gene metagenomics Database bias Marker-gene metagenomics Reference databases for 16SrRNA Beware of regularly updated versions of the dbs Marker-gene metagenomics Taxonomy classifiers for 16S rRNA gene sequences The usual data analysis pipeline 1. Preprocessing:Fastq files preprocessing, deconvolution, QC, trimming, joining reads 2. Identification of sequences representative of potential species 3. Taxonomy assignment 4. Exploratory analysis • Analysis of diversity measures and their visualization 5. Inference analysis • Associating composition with variables of interest .fastq Raw sequences 1. Deconvolution, QC, filtering, trimming, PE read joiningFiltered sequences .fastq .fastq .fastq OTUs / ASVs 3. Taxonomy assignment (species identification) Exploring and inference Alpha diversity 2. Binning Alpha diversity (within-sample diversity) • Used in ecology, a measure of how diverse a single sample is 1. Shannon index 2. Simpson index 3. Number of ASVs Alpha diversity (within-sample diversity) 3. Number of OTUs/ASVs 4. Chao 1 A= N + S2 / (2 D) • N is the number of OTUs/ASVs • S is the number of singleton OTUs/ASVs • D is the number of doublet OTUs, i.e. OTUs with abundance2. PCA – principal component analysis Based on the normalized c ompositional profiles Clustering Based on the normalized c ompositional profiles Group comparison – comparing differences between groups Applying statistical testing on each bacteria to determine difference in their abundance Statistical considerations Compositional nature of the metagenomic data • The microbiome abundancies (read counts)are constrained by the maximum number of DNA reads that the sequencer can provide (the total count constraint) • Hence the data representsin fact a proportion (composition) of taxa! 3000 500 14000 4000 200 24000 5000 100 300 SAMPLE 1 SAMPLE 2 SAMPLE 3 Firmicutes Bacteriodetes Alphaproteobacteria The data is compositional – so what? • The compositional nature of the data induces strong dependencies among the abundances of the different taxa: • an increase in the abundance of one taxon implies the decrease of the observed number of counts – hence proportions - for other taxaand vice versa 3000 500 14000 4000 200 24000 5000 100 300 SAMPLE 1 SAMPLE 2 SAMPLE 3 Firmicutes Bacteriodetes Alphaproteobacteria 3000 500 14000 4000 24000 100 300 SAMPLE 1 SAMPL E 2 SAMPLE 3 Firmicutes Bacteriodetes Alphaproteobacteria Before treatment After treatment The data is compositional – so what? In a composition the value of each component is not informative by itself and the relevant information is contained in the ratios between the components The most known: Firmicutes / Bacteroides ratio Firmicutes/Bacteroides ratio? Effect of DNA isolation kit on Gram+vs Gram- bacteria Videnskaet al. (2019)Stool sampling and DNA isolation kits affectDNA quality and bacterial compositionfollowing 16S rRNA gene sequencing using MiSeq Illumina platform,Scientific Reports Effect of sampling kit on G+ and G- bacteria Videnskaet al. (2019)Stool sampling and DNA isolation kits affectDNA quality and bacterial compositionfollowing 16S rRNA gene sequencing using MiSeq Illumina platform, Scientific Reports The data is compositional– so what? / part 2 • Compositional data do not exist in the Euclidean space, but in a special constraint space called the simplex • Hence it is incorrect to apply any multivariable techniques that are dependent on this space without proper data transformation (e.g. PCA, clustering….) BacteroidesFirmicutes Actinobacteria 100% 0% PCA on compositional data (without proper transformation) individuals Statistical methods for analysis of compositional data need to fulfill these criteria: 1. Scale Invariance (e.g. the result should be the same regardless of the scale of the measurement) • Example: how similar are these two samples? % Absolute read counts A B A B Fusobacteria 10 11 700 11000 Proteobacteria 15 14 1050 14000 Bacterioides 25 20 1750 20000 Firmicutes 50 55 3500 55000 Euclidean distance 7.2 57088.6 Statistical methods for analysis of compositional data need to fulfill thesecriteria: 2. Subcompositional coherence (e.g. the analyses should lead to the same conclusions regardless of which components of the data are included) This is especially a problem for correlations between taxa, which tend to be more negative when we remove some taxa and recalculate the proportions. Statistical methods for analysis of compositional data need to fulfill these criteria: 2. Subcompositional coherence (e.g. the analyses should lead to the same conclusions regardless of which components ofthe data are included) Alternative(s) to correlation: 1. phi (Φ) = var(Ai -Aj)/var(Ai) 2. rho (⍴) = var(Ai -Aj)/(var(Ai) + var(Aj)) 3. phis (Φs) = var(Ai -Aj)/var(Ai +Aj) Ai is the log-transformed values for a metagenomic component ‘i’ in the data Aitchison, 1982, J.R.Statist. Soc. Lovell et al., 2015, PLoS Comp Biol Quinn et al, 2017, Scientific Reports 7 Data transformation (normalization) • Compositional data can be normalized in order to make them suitable for existing statistical techniques • Aitchinson, 1982 - build a theory and concepts of analysis of compositional data and suggested normalizations • Basic concept – make log-ratios between components ALR (additive log-ratio transformation) + good for most statistical techniques – needs careful selection of one component, we are working with k-1 taxa, more difficult to interpret CLR (centered log-ratio transformation) + ratio to geometric mean, preserves all taxa, no need to select one – creates singular covariancematrix ILR (isometric log-ratio transformation)[Egozcque, 2003] PhILR (phylogenetic partitioning based ILR transform)[Silverman et al, 2017] Compositional data - PCA before and after normalization PCA on absolute counts– main variance lies in the sequencing depth Relativedata – problem with simplex, colour by individual CLR transformed data– colour by individual The excess zero problem • Log-ratio transformations require data with positive values, any statistical analysis of count compositions must be preceded by a proper replacement of the zeros • What to do? • We need to fill in the zeroes.... • E.g. Bayesian multiplicative treatment of count zeros [Martín-Fernandez,2014,StatisticalModelling] We do not know whether the zeros are real or just below th reshold The 16SrRNAgene copy number problem • Different taxa have different number of copies of the 16S rRNA gene - this • Some algorithms exist that try to normalize this count, but for many taxa this is unknown, hence estimation takes place. • Should we normalize for count or not? Take home messages • Metagenomic data are compositional and it is not optional! • Compositional data do not exist in Euclidean space • Transformations of compositional data and specific statistical approaches are needed for data analysis • It is not easy to make correlations between the taxa • Normalization to 16SrRNA gene copy number change is necessary, but still quite tricky and the benefits are not clear Resources to study • OUT vs ASV explained VIDEO • Dan Knight: microbiome discovery – series of videos (however, uses QIIME and OTUs, which is outdated now, still very worth it!) • PathoScope 2.0: a complete computationalframework for strain identification in environmental or clinical sequencing samples | Microbiome | Full Text (biomedcentral.com) Metagenomic pipelines Pipelines • DADA2 • QIIME2 • Mothur • PathoScope 2.0 • Kraken Databases • SILVA 138 • GreenGenes 13_8 • RefSeq2020 • Kraken • Blast nt/rna RCX pipeline – nf-core/ampliseq Quality control • Data preprocessing • Check reads quality • Perform filt&trim ASV calculation • DADA2 • Error estimation • Chimera removal • Contamination removal • Filtering Assign taxonomy • Database dependent • Infer species • Confidence intervals • Multiple assignment Taxonomic filtering • Filter specific taxa • Abundance filtering Post processing • Visualisation • Diversity computation • Functional analysis Visualisations Picrust2 • Phylogenetic Investigationof Communities by Reconstructionof UnobservedStates • Functional analysis • Gene families • KEGG and COG database • Based on phylogeny • Genes present in microbial genomes are similar amongst relatives • When sufficient genome sequences are available, it is possible to predict which gene families are present in a given microbial OTU from phylogeny alone. Reporting • Output files • Additional analysis • In-house post-processing