Variant calling in targeted sequencing Mgr. Eva Budinská, Ph.D. Genotype and variant calling Variant calling – identifies variable sites in genome Genotype calling – determines the genotype for each individual at each site (0/0, 1/1 – homozygote, 0/1 - heterozygote) Pirooznia et al. (2014) validation and assessment of variant calling pipelines for next-generation sequencing. Human Genomics 8:14 Aim of variant calling - To define genomic positions with single nucleotide polymorphisms, deletions, insertions or long repetitive sequence insertions specific for the sample - and if of interest, consequently call a genotype (identify whether the sample is heterozygote or homozygote) for the allele at the particular position Results of variant calling Results of variant calling algorithms are presented in standardized VCF file (variant calling format). … definition from lecture 3 ...format example from lecture 3 Interpreting the file header Version of the VCF specification to which the file conforms (important for handling and interpreting files!) The filter line – what type of filters have been applied to the data. Which reference file was used Interpreting the file header - INFO INFO lines define shortcuts for various site-level annotations. These are then later used in the INFO field in the variant specification. Beware, the definitions may differ between tools generating vcf files... Interpreting the file header - FORMAT FORMAT lines define how the genotype and other sample-level information is represented. Beware, the definitions may differ between tools generating vcf files... Interpreting the records: Reference sequence at the position Alternative sequence(s) at the position Quality: The Phred-scaled probability that REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance of error. This number can grow very large when a large amount of data is used for variant calling => not often a very useful property for evaluating the quality of a variant call. Identifiers of the position and gene at which the variant was found Interpreting the records: FILTER Result of quality filters applied for filtering out low quality variant calls. The values are defined in the header: Let's get more in detail: INFO Various site-level annotations and their values, as defined in the INFO lines in the header. Let's get more in detail: FORMAT ● How the genotype and other sample-level information is represented in the sample columns! Sample NA0001, first line variant: ● Genotype: 0|0 (homozygote for reference allele) ● Genotype quality: 48 ● Read depth: 1 ● Haplotype quality: 51 and 51 More info about vcf files ● http://gatkforums.broadinstitute.org/discussion/1268/what-is-a-vc f-and-how-should-i-interpret-it ● http://www.1000genomes.org/node/101 ● Help files of the tools generating vcf files! VCF tools ● Different methods of calling vcf, often designed on specific NGS tools/data types – not necessarily compatible! Variant identification Tools can be (very roughly) divided into: ● Germline callers – central part of finding causes of rare inherited diseases ● Somatic callers – mainly cancer studies ● CNV identification – large structural modifications ● SV (structural variants) identification – inversions, translocations or large indels (insertions-deletions) Pabinger et al. (2013) Survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform Genotype calling methods ● Cutoff-based methods (early methods) – simple: based on SNP and genotype calling on separate analyses of data from each individual ● Probabilistic methods – based usually on multiple samples and assigning probabilities for a given genotype Cutoff-based genotype calling methods ● Simple: based on SNP and genotype calling on separate analyses of data from each individual ● First step: filtering step in which only high-confidence bases are kept (Qphred>20) ● Second step:genotype calling – counting number of times each allele is observed, using fixed cutoffs (e.g. if the proportion of the alternative allele is between 20-80%, heterozygote is called) ● Used mainly for high sequencing depths (>20X), so that the probability of a heterozygous individual falling outside the selected range(20-80%) is small ● Empirically determined cutoffs can be used ● Usually used in targeted sequencing ● Does not provide measures of uncertainty in the genotype inference Probabilistic genotype calling methods ● For moderate to low sequencing depths (mainly WGS, WES) – genotype calling based on fixed cutoffs will typically lead to under-calling of heterozygous phenotypes ● Several probabilistic methods were developed that use the quality score to provide a posterior probability for each genotype: ● Advantages: – higher accuracy – natural framework for incorporating information regarding allelle frequencies and patterns of LD (linkage disequilibrium) P (X | G) – genotype likelihood for genotype G can be computed (X represents all of the read data for a particular individual and site) P(G) – genotype prior From these two, Bayes' formula is used to calculate P (G | X) – the posterior probability of genotype G The genotype with the highest posterior probability is generally chsen and this probability is used as a measure of confidence. Probabilistic genotype calling methods ● For moderate to low sequencing depths (mainly WGS, WES) – genotype calling based on fixed cutoffs will typically lead to under-calling of heterozygous phenotypes ● Several probabilistic methods were developed that use the quality score to provide a posterior probability for each genotype: ● Advantages: – higher accuracy – natural framework for incorporating information regarding allelle frequencies and patterns of LD (linkage disequilibrium) P (X | G) – genotype likelihood for genotype G can be computed (X represents all of the read data for a particular individual and site) P(G) – genotype prior From these two, Bayes' formula is used to calculate P (G | X) – the posterior probability of genotype G The genotype with the highest posterior probability is generally chsen and this probability is used as a measure of confidence. Calculating genotype likelihood P(X|G) ● Can be computed from quality scores for each read Xi – data in read I for a particular individual and particular site with genotype G ● P(Xi |G) is given by rescaling of the quality core of Xi and the genotype likelihood P(X|G) can be calculated directly from the data by taking the product of P(Xi |G) over all i ● See for instance: – https://www.broadinstitute.org/gatk/media/docs/Samtools.pdf Calculating genotype priors ● Can be performed using single or multiple samples ● Single-sample: Prior-genotype probability may be chosen to assign equal probability to all genotypes, or it can be based on external information (e.g. the reference sequence, SNP databases or an available population sample) – In SOAPsnp tool, a prior is chosen by the use of dbSNP – Example: if a G/T polymorphism is reported in dbSNP, the prior probabilities are set to be 0.454 for each of the genotypes GG and TT; 0.0909 for GT and less than 10e-4 for all other genotypes ● Multiple-samples: Priors derived by joint analysis of multiple individuals – by analysis of allele or genotype frequencies estimated from larger data sets e.g. using maximum likelihood Tools for genotype calling Nielsen et al. (2011): Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12, 443-451 pindel ● Common variant calling tools are not designed to find long repetitive inserts! ● Pindel is a very good alternative – mainly for targeted sequencing data. http://gmt.genome.wustl.edu/packages/pindel/ ● Generally, it is recommended to combine multiple tools, based on the biological questions and data type! Genotype calling? ● Attention – sometimes, it is not of interest to call a genotype – e.g. if heterogenous tissues are sampled and we can expect different clones! – Example: cancer cells – one clone from thousands can regrow into metastasis! Variant annotation ● We have the possibility of predicting the functional impact of variants in an automated fashion ● This is very important for biological interpretation of the results! ● Not all the tools are able to annotate all types of variants ● Output: usually vcf file, with information on annotation in the INPUT field Example of annotated .vcf file #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr1 868329 . A C 25.56 LowQual AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;M Q=99.00;QD=12.78;SOR=0.693;ANN=C|intron_variant|MODIFIER| SAMD11|SAMD11|transcript|NM_152486.2|Coding|4/13|c.305+186 0A>C|||||| GT:AD:DP:FT:GQ:PL chr1 1665702 . T C 62.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;M Q=60.00;QD=31.27;SOR=2.303;ANN=C|intron_variant|MODIFIER| SLC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+42 7A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|transcrip t|NM_001199787.1|Coding|6/6|c.732+427A>G|||||| GT:AD:DP:FT:GQ:PL chr1 1665740 . T C 66.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;M Q=60.00;QD=33.27;SOR=2.303;ANN=C|intron_variant|MODIFIER| SLC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+38 9A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|transcrip t|NM_001199787.1|Coding|6/6|c.732+389A>G|||||| GT:AD:DP:FT:GQ:PL Pabinger et al. (2013) Survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform Variant annotation tools