Detecting RNA modification
from nanopore signal
+ exercises
Vlastimil Martinek
Central dogma of molecular biology - refresh
Modiﬁcations
- Alter exposed DNA/RNA
- Change gene expression
- Can be a biological target
- Immune system
- Repair
- Multiple types of mods
ACCGTC <=> ACʼCʼGTCʼ
ONT (Oxford nanopore technologies)
- Sequencing to attain digital data
- Basecalling
Existing methods
- Signal has mods information
- Need for a control sequence
- Dependent on basecaller
New approach
Detect modifications without reference
Modified / Not modified
OR
Data
- 1D signal per sequence
- Variable length
- Two labels - modified/non-modified
- 1M positives, 2M negatives
- Preprocessing?
Data exploration and preparation
Preprocessing
- Cut primers
- Window
- Balanced sampling
- Standardized read-wise
- …?
Noisy data labels
High false-positive rate (50-70%)
All negatives are correct
Aﬀects metrics - perfect model 65-75%
How to tackle?
Label cleaning
Confident learning (google, mit, cleanlab)
Input: Labels + predicted probabilities
Predicts label mistakes on validation data
(kfold needed to clean the dataset)
From the matrix on the right in the figure, to estimate label issues:
1. Multiply the joint distribution matrix by the number of
examples. Let’s assume 100 examples in our dataset. So, by
the figure above (Q matrix on the right), there are 10 images
labeled dog that are actually images of foxes.
2. Mark the 10 images labeled dog with largest probability of
belonging to class fox as label issues.
3. Repeat for all non-diagonal entries in the matrix
y~ current labels
y* true labels
Conﬁdent learning
“The central idea is that when the predicted probability of an example is greater than a per-class-threshold, we
confidently count that example as actually belonging to that threshold’s class. The thresholds for each class
are the average predicted probability of examples in that class.”
Preliminary results = it works (positive labels marked)
y~ actual labels
y* true labels
Conﬁdent learning (CIFAR-10 acc)
Conﬁdent learning
Model brainstorming
Initial architecture
1D resnet
~60% accuracy
Transfer learning with basecallers (DNA)
Custom head
Training
Transfer learning with basecallers (DNA)
LR diﬀerences between modules
LR warmup
Accuracy ~72%
Input length (positives shorter)
Future
Transformer + limited attention
RNA basecaller
Interpretability + basecalling info
Learning with cleaned labels
Speedup parallelization
Experiment diﬀerences
Use statistical modification methods
Base-wise labeling
Sources
https://inno-forum.org/single-cell-rna-sequencing-technique-come-age/
https://www.labclinics.com/2018/11/08/role-dna-methylation-disease/?lang=en
https://www.yourgenome.org/facts/what-is-oxford-nanopore-technology-ont-sequenc
ing/