Kallisto “Near-optimal probabilistic RNA-seq quantification“ Webpage: https://pachterlab.github.io/kallisto/ What is kallisto? • Program for quantifying abundances of transcripts • target sequences using high-throughput sequencing reads • Bulk/Single cell RNA-seq data • Based on pseudoalignment (alignment-free) • „we develop a method based on pseudoalignment of reads and fragments, which focuses only on identifying the transcripts from which the reads could have originated and does not try to pinpoint exactly how the sequences of the reads and transcripts align.“ What is kallisto? . • High-speed (30 mil. human reads in less than 3 minutes on Mac desktop / index ca. 10 min.) • Pseudoalignment of reads preserves key information needed for quantification and kallisto is therefore not only fast, but also as accurate as existing quantification tools • Pseudoalignment procedure is robust to errors in the reads – in many benchmarks kallisto significantly outperforms existing tools What is kallisto? • Released: 2015/2016 • Latest release: Jan 17 2022 • Distribution: Windows, Mac/Linux, Rock64 Webpage: https://pachterlab.github.io/kallisto/ GitHub: https://github.com/pachterlab/kallisto/ Bioconda: https://anaconda.org/bioconda/kallisto/ How does it work? • a) construction of de Bruijn graph from k-mers present in the transcriptome (T-DBG) • b) path covering corresponding to transcripts = compatibility classes; nodes = k-mers • c) association of compatibility classes to an error-free read = representing as a path in the graph, based on the similarity of k-mers How does it work? • d) Removing redundant k-mers for the pseudoalignment = speed increase • e) An equivalence class for a read is a multi-set of transcripts associated with the read • ideally it represents the transcript a read could have originated from • equivalence classes are quantified via use of Expectation Maximization (EM) algorithm to determine maximum likelihood How do you use it? • 1. Indexing • 2. Quantification How do you use it? • Outputs: • table in *.h5 / *.tsv • run information (*.json) Why should you use it? • Test simulation • 20 RNA-seq simulations/experiments • Curated reference sample • 75 bp paired-end RNA-seq reads • 30 mil. reads • qPCR control for transcript abundance • efficiency testing Why you should (or should not?) use it? • Accuraccy • Uses T-DBG graph – deals with multimapping reads via path covering (compatibility / equivalent classes) and maximum likelihood algorithm (also for overlaps) • Relies on high-quality transcriptome for indexing • Does not discard reads with low mapping rates – if there is not a better match, these reads are pseudoaligned due to ML algorithm even though there is only a single k-mer match • Speed • Removes k-mers where sequencing errors are observed (can’t be found in the index) • Removes redundant k-mers from computation • Resources • Multithreading (all datasets in parallel) • Relatively low RAM and CPU usage (small laptop test runtime: 10 minutes) Thank you for your attention!