Variant calling in targeted
sequencing
Mgr. Eva Budinská, Ph.D.
Genotype and variant calling
Variant calling – identifies variable sites in genome
Genotype calling – determines the genotype for each individual at
each site (0/0, 1/1 – homozygote, 0/1 - heterozygote)
Pirooznia et al. (2014) validation and assessment of variant calling pipelines for next-generation sequencing. Human Genomics 8:14
Aim of variant calling
- To define genomic positions with single nucleotide
polymorphisms, deletions, insertions or long repetitive sequence
insertions specific for the sample
- and if of interest, consequently call a genotype (identify whether
the sample is heterozygote or homozygote) for the allele at the
particular position
Results of variant calling
Results of variant calling algorithms are presented in
standardized VCF file (variant calling format).
… definition from lecture 3
...format example from lecture 3
Interpreting the file header
Version of the VCF specification to
which the file conforms (important for
handling and interpreting files!)
The filter line – what type of filters have
been applied to the data.
Which reference file was used
Interpreting the file header - INFO
INFO lines define shortcuts for various site-level annotations.
These are then later used in the INFO field in the variant
specification. Beware, the definitions may differ between tools
generating vcf files...
Interpreting the file header - FORMAT
FORMAT lines define how the genotype and other sample-level
information is represented. Beware, the definitions may differ
between tools generating vcf files...
Interpreting the records:
Reference sequence
at the position
Alternative
sequence(s) at the
position
Quality: The Phred-scaled probability that
REF/ALT polymorphism exists at this site given
sequencing data.
Because the Phred scale is -10 * log(1-p), a
value of 10 indicates a 1 in 10 chance of error,
while a 100 indicates a 1 in 10^10 chance of
error.
This number can grow very large when a large
amount of data is used for variant calling => not
often a very useful property for evaluating the
quality of a variant call.
Identifiers of the
position and gene at
which the variant was
found
Interpreting the records: FILTER
Result of quality filters applied for filtering out low quality variant calls. The values are defined
in the header:
Let's get more in detail: INFO
Various site-level annotations and their values, as
defined in the INFO lines in the header.
Let's get more in detail: FORMAT
● How the genotype and other sample-level information is
represented in the sample columns!
Sample NA0001, first line variant:
● Genotype: 0|0 (homozygote for
reference allele)
● Genotype quality: 48
● Read depth: 1
● Haplotype quality: 51 and 51
More info about vcf files
● http://gatkforums.broadinstitute.org/discussion/1268/what-is-a-vc
f-and-how-should-i-interpret-it
● http://www.1000genomes.org/node/101
● Help files of the tools generating vcf files!
VCF tools
● Different methods of calling vcf, often designed on specific NGS
tools/data types – not necessarily compatible!
Variant identification
Tools can be (very roughly) divided into:
● Germline callers – central part of finding causes of rare
inherited diseases
● Somatic callers – mainly cancer studies
● CNV identification – large structural modifications
● SV (structural variants) identification – inversions, translocations
or large indels (insertions-deletions)
Pabinger et al. (2013) Survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform
Genotype calling methods
● Cutoff-based methods (early methods)
– simple: based on SNP and genotype calling on separate
analyses of data from each individual
● Probabilistic methods
– based usually on multiple samples and assigning
probabilities for a given genotype
Cutoff-based genotype calling methods
● Simple: based on SNP and genotype calling on separate analyses of data
from each individual
● First step: filtering step in which only high-confidence bases are kept
(Qphred>20)
● Second step:genotype calling – counting number of times each allele is
observed, using fixed cutoffs (e.g. if the proportion of the alternative allele is
between 20-80%, heterozygote is called)
● Used mainly for high sequencing depths (>20X), so that the probability of a
heterozygous individual falling outside the selected range(20-80%) is small
● Empirically determined cutoffs can be used
● Usually used in targeted sequencing
● Does not provide measures of uncertainty in the genotype inference
Probabilistic genotype calling methods
● For moderate to low sequencing depths (mainly WGS, WES) – genotype calling based on
fixed cutoffs will typically lead to under-calling of heterozygous phenotypes
● Several probabilistic methods were developed that use the quality score to provide a
posterior probability for each genotype:
● Advantages:
– higher accuracy
– natural framework for incorporating information regarding allelle frequencies and
patterns of LD (linkage disequilibrium)
P (X | G) – genotype likelihood for genotype G can be computed (X represents all of the read
data for a particular individual and site)
P(G) – genotype prior
From these two, Bayes' formula is used to calculate P (G | X) – the posterior probability of
genotype G
The genotype with the highest posterior probability is generally chsen and this probability is used
as a measure of confidence.
Probabilistic genotype calling methods
● For moderate to low sequencing depths (mainly WGS, WES) – genotype calling based on
fixed cutoffs will typically lead to under-calling of heterozygous phenotypes
● Several probabilistic methods were developed that use the quality score to provide a
posterior probability for each genotype:
● Advantages:
– higher accuracy
– natural framework for incorporating information regarding allelle frequencies and
patterns of LD (linkage disequilibrium)
P (X | G) – genotype likelihood for genotype G can be computed (X represents all of the read
data for a particular individual and site)
P(G) – genotype prior
From these two, Bayes' formula is used to calculate P (G | X) – the posterior probability of
genotype G
The genotype with the highest posterior probability is generally chsen and this probability is used
as a measure of confidence.
Calculating genotype likelihood P(X|G)
● Can be computed from quality scores for each read
Xi – data in read I for a particular individual and particular site with genotype
G
● P(Xi |G) is given by rescaling of the quality core of Xi and the genotype
likelihood P(X|G) can be calculated directly from the data by taking the
product of P(Xi |G) over all i
● See for instance:
– https://www.broadinstitute.org/gatk/media/docs/Samtools.pdf
Calculating genotype priors
● Can be performed using single or multiple samples
● Single-sample: Prior-genotype probability may be chosen to assign equal
probability to all genotypes, or it can be based on external information (e.g.
the reference sequence, SNP databases or an available population sample)
– In SOAPsnp tool, a prior is chosen by the use of dbSNP
– Example: if a G/T polymorphism is reported in dbSNP, the prior
probabilities are set to be 0.454 for each of the genotypes GG and TT;
0.0909 for GT and less than 10e-4 for all other genotypes
● Multiple-samples: Priors derived by joint analysis of multiple individuals –
by analysis of allele or genotype frequencies estimated from larger data sets
e.g. using maximum likelihood
Tools for genotype calling
Nielsen et al. (2011): Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12, 443-451
pindel
● Common variant calling tools are not designed to find long repetitive inserts!
● Pindel is a very good alternative – mainly for targeted sequencing data.
http://gmt.genome.wustl.edu/packages/pindel/
● Generally, it is recommended to combine multiple tools, based on the
biological questions and data type!
Genotype calling?
● Attention – sometimes, it is not of interest to call a genotype –
e.g. if heterogenous tissues are sampled and we can expect
different clones!
– Example: cancer cells – one clone from thousands can
regrow into metastasis!
Variant annotation
● We have the possibility of predicting the functional impact of
variants in an automated fashion
● This is very important for biological interpretation of the results!
● Not all the tools are able to annotate all types of variants
● Output: usually vcf file, with information on annotation in the
INPUT field
Example of annotated .vcf file
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
chr1 868329 . A C 25.56 LowQual AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;M
Q=99.00;QD=12.78;SOR=0.693;ANN=C|intron_variant|MODIFIER|
SAMD11|SAMD11|transcript|NM_152486.2|Coding|4/13|c.305+186
0A>C||||||
GT:AD:DP:FT:GQ:PL
chr1 1665702 . T C 62.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;M
Q=60.00;QD=31.27;SOR=2.303;ANN=C|intron_variant|MODIFIER|
SLC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+42
7A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|transcrip
t|NM_001199787.1|Coding|6/6|c.732+427A>G||||||
GT:AD:DP:FT:GQ:PL
chr1 1665740 . T C 66.55 PASS AC=2;AF=1.00;AN=2;DP=2;FS=0.000;MLEAC=2;MLEAF=1.00;M
Q=60.00;QD=33.27;SOR=2.303;ANN=C|intron_variant|MODIFIER|
SLC35E2|SLC35E2|transcript|NM_182838.2|Coding|5/5|c.732+38
9A>G||||||,C|intron_variant|MODIFIER|SLC35E2|SLC35E2|transcrip
t|NM_001199787.1|Coding|6/6|c.732+389A>G||||||
GT:AD:DP:FT:GQ:PL
Pabinger et al. (2013) Survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform
Variant annotation tools