Kallisto
“Near-optimal probabilistic
RNA-seq quantification“
Webpage: https://pachterlab.github.io/kallisto/
What is kallisto?
• Program for quantifying abundances of transcripts
• target sequences using high-throughput sequencing reads
• Bulk/Single cell RNA-seq data
• Based on pseudoalignment (alignment-free)
• „we develop a method based on pseudoalignment of reads and fragments, which focuses only on
identifying the transcripts from which the reads could have originated and does not try to pinpoint
exactly how the sequences of the reads and transcripts align.“
What is kallisto?
.
• High-speed (30 mil. human reads in less than 3 minutes on Mac desktop / index ca. 10 min.)
• Pseudoalignment of reads preserves key information needed for quantification and kallisto
is therefore not only fast, but also as accurate as existing quantification tools
• Pseudoalignment procedure is robust to errors in the reads – in many benchmarks kallisto
significantly outperforms existing tools
What is kallisto?
• Released: 2015/2016
• Latest release: Jan 17 2022
• Distribution: Windows, Mac/Linux, Rock64
Webpage: https://pachterlab.github.io/kallisto/
GitHub: https://github.com/pachterlab/kallisto/
Bioconda: https://anaconda.org/bioconda/kallisto/
How does it work?
• a) construction of de Bruijn graph from k-mers
present in the transcriptome (T-DBG)
• b) path covering corresponding to transcripts
= compatibility classes; nodes = k-mers
• c) association of compatibility classes to an
error-free read = representing as a path in the
graph, based on the similarity of k-mers
How does it work?
• d) Removing redundant k-mers for the
pseudoalignment = speed increase
• e) An equivalence class for a read is a multi-set of
transcripts associated with the read
• ideally it represents the transcript a read could
have originated from
• equivalence classes are quantified via use of
Expectation Maximization (EM) algorithm to
determine maximum likelihood
How do you use it?
• 1. Indexing
• 2. Quantification
How do you use it?
• Outputs:
• table in *.h5 / *.tsv
• run information (*.json)
Why should you use it?
• Test simulation
• 20 RNA-seq simulations/experiments
• Curated reference sample
• 75 bp paired-end RNA-seq reads
• 30 mil. reads
• qPCR control for transcript abundance
• efficiency testing
Why you should (or should not?) use it?
• Accuraccy
• Uses T-DBG graph – deals with multimapping reads via path covering (compatibility /
equivalent classes) and maximum likelihood algorithm (also for overlaps)
• Relies on high-quality transcriptome for indexing
• Does not discard reads with low mapping rates – if there is not a better match, these
reads are pseudoaligned due to ML algorithm even though there is only a single k-mer
match
• Speed
• Removes k-mers where sequencing errors are observed (can’t be found in the index)
• Removes redundant k-mers from computation
• Resources
• Multithreading (all datasets in parallel)
• Relatively low RAM and CPU usage (small laptop test runtime: 10 minutes)
Thank you for your attention!