Detecting RNA modification from nanopore signal + exercises Vlastimil Martinek Central dogma of molecular biology - refresh Modifications - Alter exposed DNA/RNA - Change gene expression - Can be a biological target - Immune system - Repair - Multiple types of mods ACCGTC <=> ACʼCʼGTCʼ ONT (Oxford nanopore technologies) - Sequencing to attain digital data - Basecalling Existing methods - Signal has mods information - Need for a control sequence - Dependent on basecaller New approach Detect modifications without reference Modified / Not modified OR Data - 1D signal per sequence - Variable length - Two labels - modified/non-modified - 1M positives, 2M negatives - Preprocessing? Data exploration and preparation Preprocessing - Cut primers - Window - Balanced sampling - Standardized read-wise - …? Noisy data labels High false-positive rate (50-70%) All negatives are correct Affects metrics - perfect model 65-75% How to tackle? Label cleaning Confident learning (google, mit, cleanlab) Input: Labels + predicted probabilities Predicts label mistakes on validation data (kfold needed to clean the dataset) From the matrix on the right in the figure, to estimate label issues: 1. Multiply the joint distribution matrix by the number of examples. Let’s assume 100 examples in our dataset. So, by the figure above (Q matrix on the right), there are 10 images labeled dog that are actually images of foxes. 2. Mark the 10 images labeled dog with largest probability of belonging to class fox as label issues. 3. Repeat for all non-diagonal entries in the matrix y~ current labels y* true labels Confident learning “The central idea is that when the predicted probability of an example is greater than a per-class-threshold, we confidently count that example as actually belonging to that threshold’s class. The thresholds for each class are the average predicted probability of examples in that class.” Preliminary results = it works (positive labels marked) y~ actual labels y* true labels Confident learning (CIFAR-10 acc) Confident learning Model brainstorming Initial architecture 1D resnet ~60% accuracy Transfer learning with basecallers (DNA) Custom head Training Transfer learning with basecallers (DNA) LR differences between modules LR warmup Accuracy ~72% Input length (positives shorter) Future Transformer + limited attention RNA basecaller Interpretability + basecalling info Learning with cleaned labels Speedup parallelization Experiment differences Use statistical modification methods Base-wise labeling Sources https://inno-forum.org/single-cell-rna-sequencing-technique-come-age/ https://www.labclinics.com/2018/11/08/role-dna-methylation-disease/?lang=en https://www.yourgenome.org/facts/what-is-oxford-nanopore-technology-ont-sequenc ing/